Compare commits

...

334 Commits

Author SHA1 Message Date
08fb743598 chore: bump version 0.8.1 → 0.8.2
dogfood-discovered fixes (PR #145) land in production:
- schema.v1.repo_breakdown 가 실제로 채워짐 (이전: 항상 빈 BTreeMap)
- workspace.include glob 가 walker 에서 enforce 됨 (이전: 완전 무시)

patch bump 사유: 둘 다 advertised surface 의 정상 동작 복원.
새 wire / config / surface 변경 없음.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:20:48 +00:00
0a2a7ae214 Merge pull request 'fix(dogfood): schema.repo_breakdown + workspace.include walker enforcement (dogfood-discovered)' (#145) from fix/dogfood-bugs-schema-walker-incremental into main 2026-05-20 05:18:59 +00:00
803d02b68b fix(dogfood): enforce workspace.include in walker (allow-list semantics)
config.workspace.include was completely ignored by the walker — connector.rs
log_scope_include_warning literally said "handled by extractor router" but
no extractor router exists. Dogfooding (PR #142 1B + multi-root corpus
kebab-docs + httpx + zod + lodash) showed user-set include of code+md still
ingested 84 .png + 8 .pdf files.

Fix: walker treats scope.include as an allow-list — empty Vec preserves
backward-compat (all files pass), non-empty requires file path to match at
least one pattern (AND with the existing exclude rules). Removed the
misleading debug log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:15:04 +00:00
4e8b84c4e0 fix(dogfood): populate schema.v1.repo_breakdown (Task 9 follow-up)
Dogfooding (PR #142 1B + multi-root corpus: kebab-docs + httpx + zod + lodash)
revealed schema.v1.repo_breakdown is always {} despite the 1A-2 Task 9
having added the code_lang_breakdown sibling. The schema.rs:171 placeholder
`BTreeMap::new()` was left in place. Mirror Task 9's code_lang_breakdown
query for the repo field — same metadata_json JSON-path pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:09:19 +00:00
16dc02cfa2 chore: bump version 0.8.0 → 0.8.1
dogfood-discovered code_lang/repo filter bug (PR #144) fix lands in
production. patch bump because:
- 1A-1 advertised CLI flags --code-lang / --repo were live but inert
  (SearchFilters fields propagated but never applied to retriever SQL)
- fix restores intended behavior; no new wire surface
- user has dogfooded against httpx + zod + lodash and re-validating
  needs the fixed binary

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 03:35:36 +00:00
74f1b0571b Merge pull request 'fix(p10-1a-1): apply code_lang + repo filters in lexical SQL and filter_chunks (dogfood)' (#144) from fix/p10-1a-1-code-lang-repo-filter-sql into main 2026-05-20 03:34:53 +00:00
918ee6c0be fix(p10-1a-1): apply code_lang + repo filters in lexical SQL and filter_chunks (dogfood-discovered)
p10-1A-1 (PR #139) added SearchFilters.code_lang + .repo fields and the CLI
--code-lang / --repo flags propagate them correctly into SearchFilters, but
neither the lexical retriever's FTS SQL nor the shared filter_chunks helper
(used by the vector retriever) ever applied them — so a code-lang-filtered
search returned all-doc hits (markdown / pdf / code mixed).

Discovered while dogfooding p10-1B with httpx + zod + lodash clones:
`kebab search 'AsyncClient' --code-lang python --json` returned markdown
hits from httpx/docs/ first.

Fix: add IN-list filters on json_extract(d.metadata_json, '$.code_lang')
and '$.repo' to both lexical.rs and filters.rs, mirroring the existing
media filter pattern. Two regression tests added in each crate covering
the new filter behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 03:27:01 +00:00
68ada396f3 Merge pull request 'fix(p10-1b): apply round-1 lang.rs doc + tests/ test case missed in 4503b5b' (#143) from fix/p10-1b-lang-doc-test-staging-miss into main 2026-05-20 02:31:13 +00:00
23c4ad97b9 fix(p10-1b): apply round-1 lang.rs doc + tests/ test case missed in 4503b5b
PR #142 round-1 fix commit 4503b5b 보고에는 lang.rs 의 (a) module_path_for_python
doc comment 갱신 (tests/examples/benches 가 의도적으로 strip 안 됨 명시) 과
(b) tests/test_foo.py → tests.test_foo 단언 추가가 포함됐다고 적혔으나,
실제 commit 에는 lang.rs 변경이 staging 되지 않아 main 에 안 들어감 (review
loop round 2 이 working tree 상태만 신뢰하고 commit 검증을 안 함).

이번 PR 이 누락된 (5)+(6) 항목만 retro 적용. lang.rs +9 lines (test 1 +
doc 4 + 주석 2 + 빈줄 2). cargo test -p kebab-parse-code --lib → 20/20 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:28:53 +00:00
1f566b8bfa Merge pull request 'feat(p10-1B): Python + TS/JS AST chunkers — tree-sitter-{python,typescript,javascript} 코드 색인 활성화' (#142) from feat/p10-1b-py-ts-js into main 2026-05-20 02:26:24 +00:00
26562588e3 fix(p10-1b): PR review round 2 — fold TS class-method decorators into unit line range
Round 1 push-back on TS/JS class-method decorator handling was based on
an inaccurate doc comment in typescript.rs that claimed decorators are
method_definition children; tree-sitter-typescript 0.23 actually places
them as class_body preceding siblings. Round 2 correctly identified the
cross-language inconsistency with Python's decorated_definition arm.

Fix: extend unit_start backward walk in typescript.rs to also accept
'decorator' siblings (three-line change + corrected doc comment).
javascript.rs is unaffected: tree-sitter-javascript stores the decorator
as a named child INSIDE method_definition, so method_definition.start_row
already covers the decorator line without any sibling walk.

Adds three regression tests:
- class_method_decorator_folded_into_method_unit (TS): asserts @Log() is
  inside the emitted method unit code and line_start == 2.
- ts_class_decorator_folded_into_class_unit (TS): class-level @Injectable()
  folded into the class unit, line_start == 1.
- js_class_method_decorator_already_folded_by_grammar (JS): documents
  that JS already includes the decorator via grammar semantics.

verify: per-crate cargo test (20 passed) + clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:20:22 +00:00
4503b5b12f fix(p10-1b): PR review round 1 — 5 actionable items
(1) tasks/HOTFIXES.md: add 2026-05-20 entry for path-sanitize gap in
    module_path_for_python / _tsjs (promised in task spec line 55 but
    not landed in round 0). Bidirectional cross-link added.

(2) crates/kebab-parse-code: dedup filename_from_workspace_path /
    strip_extension / join_symbol via new pub(crate) module scaffold.rs.
    Removed 9 byte-identical fn copies across rust/python/typescript/
    javascript extractors. Pure refactor — no behavior change.

(3) crates/kebab-parse-code/tests/fixtures/sample.py: @staticmethod was
    semantically inappropriate on a module-level fn (class-method
    decorator). Changed to @no_type_check; test assertion updated.

(5)+(6) crates/kebab-parse-code/src/lang.rs: add tests/test_foo.py case
    to module_path_for_python test + doc clarifying that tests/ /
    examples/ / benches/ are intentionally not stripped.

(4) PUSH BACK — TS/JS class decorator handling is design intent of 1B
    1차 (typescript.rs:242-244 + HOTFIXES entry 2 already in place).
    No code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:03:52 +00:00
44813df052 docs(p10-1b): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + HOTFIXES; chore: bump version 0.7.0 → 0.8.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:48:06 +00:00
d6bb6cfd3b test(p10-1b): per-language chunker snapshots (python/ts/js)
Mirrors code_rust_ast_snapshot pattern. In-memory CanonicalDocument build so
no kebab-parse-code dep (boundary §6.3 respected).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:39:17 +00:00
d53995a6d4 feat(p10-1b): code-js-ast-v1 chunker + activate JavaScript in app dispatch
Chunker: duplicate-with-substitution from code-ts-ast-v1 / code-rust-ast-v1.
Dispatch: replaces JS bail! arms with JavascriptAstExtractor + CodeJsAstV1Chunker.
Integration test javascript_file_ingests_and_searches_as_code_citation asserts
citation.lang=javascript, symbol=src/Bar.Bar.baz, code_lang=javascript.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:16:07 +00:00
c215034653 feat(p10-1b): tree-sitter-javascript AST extractor (JS + JSX)
Single-grammar variant of typescript.rs — JS handles .jsx via the same
LanguageFn. No interface/type/enum arms; otherwise identical mapping +
workspace-path prefix via module_path_for_tsjs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:09:22 +00:00
31245a4328 fix(p10-1b): TS parser_version code-typescript-v1 → code-ts-v1 (naming consistency)
Task H implementer chose code-typescript-v1 but plan + design §3.3 use the
short form (chunker is code-ts-ast-v1 / code-js-ast-v1). Aligning parser
versions to match: rust=code-rust-v1 / python=code-python-v1 / ts=code-ts-v1
/ js=code-js-v1 (Task K). Fixes 2 sites: const PARSER_VERSION + integration
test assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:05:17 +00:00
acb61b6830 feat(p10-1b): activate TypeScript in ingest_one_code_asset dispatch
Replaces TS bail! arms with TypescriptAstExtractor + CodeTsAstV1Chunker.
Adds typescript_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=typescript, symbol=src/Foo.Foo.bar, code_lang=typescript.
JS arms remain bail!() (Task L).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:59:41 +00:00
20feb3133e feat(p10-1b): code-ts-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-rust-ast-v1 / code-python-ast-v1 with language-agnostic body unchanged.
Cross-chunker policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:56:41 +00:00
de63f161ac feat(p10-1b): tree-sitter-typescript AST extractor (TS + TSX via grammar selection)
Adds `kebab_parse_code::typescript::TypescriptAstExtractor` (PARSER_VERSION
`code-typescript-v1`), mirroring the Python extractor (P10-1B Task E) and
the Rust scaffold (P10-1A-2). One `Block::Code` per top-level AST semantic
unit (free fn / class / each method / interface / type alias / enum,
recursively per nested class), each carrying `SourceSpan::Code` with the
unit's dotted symbol path prefixed by `module_path_for_tsjs`.

Grammar selection per `tree-sitter-typescript` 0.23: the workspace path's
`.tsx` extension routes to `LANGUAGE_TSX`, everything else to
`LANGUAGE_TYPESCRIPT`. The `export_statement` arm unwraps a `declaration`
field (`function_declaration` / `class_declaration` / `interface_declaration`
/ `type_alias_declaration` / `enum_declaration`) using the OUTER statement's
line range so `export ` is folded in; for `export default function () {}`
and `export default class {}` (where the inner node sits under the `value`
field as `function_expression` / `class` with no `name`), the symbol leaf
is `default`. Bare value exports / re-exports fall into glue.

Glue grouping reuses the Python post-pass: `<module>` only when the entire
group is imports + bare re-exports; demoted to `<top-level>` if the file
produced any real unit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:54:27 +00:00
1815091247 feat(p10-1b): activate Python in ingest_one_code_asset dispatch
Replaces Python bail! arms with PythonAstExtractor + CodePythonAstV1Chunker.
Adds python_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=python, symbol=kebab_eval.metrics.compute_mrr,
code_lang=python. TS/JS arms remain bail!() (Tasks J/L).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:49:01 +00:00
6a0b340941 feat(p10-1b): code-python-ast-v1 chunker (1:1 + oversize split)
Duplicate of code-rust-ast-v1 with language-agnostic body unchanged. Cross-chunker
policy_hash identity asserted vs md-heading-v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:46:17 +00:00
9664e97497 feat(p10-1b): tree-sitter-python AST extractor (PythonAstExtractor)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:41:35 +00:00
8bdb3e8090 refactor(p10-1b): generalize ingest_one_code_asset for multi-language dispatch
Rust path observably unchanged (verified by existing code_ingest_smoke tests).
Python/TS/JS arms bail with TODO; per-lang extractor + chunker land in subsequent tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:35:53 +00:00
dcad9ccda2 feat(p10-1b): module_path_for_python / _tsjs helpers (workspace path → module prefix)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:31:33 +00:00
ed0f4769b3 feat(p10-1b): route .py/.pyi/.ts/.tsx/.js/.mjs/.cjs/.jsx to MediaType::Code
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:30:07 +00:00
0c61758931 build(p10-1b): add tree-sitter-python/-typescript/-javascript workspace deps
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:28:31 +00:00
39b766ea59 docs(p10-1b): task spec + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:26:58 +00:00
7f287abacb Merge pull request 'test(eval): normalize elapsed_ms before determinism comparison (flake fix)' (#141) from fix/eval-runner-timing-flake into main 2026-05-20 00:08:40 +00:00
d715631928 test(eval): normalize elapsed_ms before determinism comparison (flake fix)
`runner_lexical_is_deterministic_per_query_payload` 가 full-suite 첫 실행에서
간헐적으로 `elapsed_ms: 0` vs `elapsed_ms: 1` 차이로 깨지는 timing flake 가
있었음 (PR #140 회차 0 의 full-suite 실행에서 관찰).

원인: per_query 전체 JSON 을 byte-identical 비교하는데 QueryResult.elapsed_ms
가 timing 기반이라 µs-scale wall-clock jitter 가 그대로 비교에 들어감. 의도는
"timing 외에 byte-identical" — 인접 snapshot test #7 은 projection 으로
timing 을 명시적으로 제외하지만 #6 은 누락.

Fix: 비교 직전 양쪽 run 의 elapsed_ms 를 0 으로 normalize. 의도 그대로
표현하고 다른 field 의 결정성 검증은 보존. 50회 반복 stress 통과 (이전:
간헐 실패).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:01:41 +00:00
73e5b359d8 Merge pull request 'feat(p10-1A-2): Rust AST chunker — tree-sitter-rust 코드 색인 활성화' (#140) from feat/p10-1a-2-rust-ast-chunker into main 2026-05-19 23:40:15 +00:00
c780aca904 fix(p10-1a-2): PR review round 2 — README wire fields + SMOKE config completeness + edge-case note + gitignore dedup
PR #140 회차 2 actionable 4건:
- README.md: `citation.kind = "code"` 행에서 wire 필드 구조 정정 — citation 안에는 `lang`, SearchHit top-level 에는 `code_lang`/`repo` (round 1 SMOKE 정정과 동일 클래스)
- docs/SMOKE.md: 격리 config 블록에 `extra_skip_globs = []` 추가 (P10 섹션의 "위 격리 config 블록 참조" 와 정합)
- crates/kebab-parse-code/src/rust.rs: comment-only 파일 → 0 blocks 동작을 module doc 에 한 줄 명시 (pdf-page-v1 의 "empty page produces no chunks" 패턴과 동일)
- .gitignore: `/target/` 제거 — `/target` (no trailing slash) 이 디렉토리 + 파일 + 심링크 모두 매칭하므로 `/target/` (dir 전용) 는 redundant

verify: `cargo check -p kebab-parse-code` clean (주석/문서 외 영향 없음).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:35:00 +00:00
b1d5047399 fix(p10-1a-2): PR review round 1 — doc inconsistencies + observable backfill error path
PR #140 회차 1 actionable 7건 반영:
- docs/SMOKE.md: parser_version "code-rust-ast-v1" → "code-rust-v1" (chunker_version 과 혼동); jq path .citation.code_lang → .citation.lang (wire 의 code_lang 은 SearchHit top-level)
- docs/ARCHITECTURE.md: Mermaid pcode→ptypes 잘못된 edge → pcode→core 로 정정 (kebab-parse-code Cargo.toml 실제 dep 와 일치); 디렉토리 트리에서 code-rust-ast-v1 chunker 표기 위치 kebab-parse-code → kebab-chunk 로 정정
- crates/kebab-app/src/app.rs: backfill_repo 의 .ok().flatten() 실패 silent swallow → tracing::warn 로 관측 가능, 비-abort 의도 보존
- crates/kebab-parse-code/src/rust.rs: impl_item arm 의 "function_item 만 unit 생성" 1A scope 한정 주석을 외부에서도 보이도록 arm 상단에 한 줄 추가 (내부 주석은 유지)

verify: kebab-parse-code 7/7 / kebab-app --lib 51/51 / code_ingest_smoke 3/3 green; touched-crate clippy clean (재부팅 전 검증).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:24:20 +00:00
80c2d31fb3 docs(p10-1a-2): README/HANDOFF/ARCHITECTURE/SMOKE/INDEX + HOTFIXES; chore: bump version 0.6.0 → 0.7.0
- README: note Rust .rs ingest active (code-rust-ast-v1), update Mermaid parse node + chunker labels, update supported formats note in Quick start and ingest command table; add code citation fields (symbol, code_lang, repo) and filter flags note
- HANDOFF: flip P10 row to note 1A-1  + 1A-2 PR open; add one-liner cross-link to HOTFIXES 2026-05-19 entries
- ARCHITECTURE: add kebab-parse-code node + edge (app → pcode, pcode → ptypes) to Mermaid graph; add directory tree entry; add code parser locked-in decision row (tree-sitter lives parser-side, design §6.3)
- SMOKE: add P10-1A-2 Rust code ingest section (ingest.code config keys, verification steps, known behaviors); add checklist item
- tasks/INDEX.md: flip p10-1A-1 to , update p10-1A-2 to 🟡 PR open
- tasks/p10/INDEX.md: same flips
- tasks/HOTFIXES.md: add two 2026-05-19 dated entries (AST_CHUNK_MAX_LINES constant vs config deviation + SourceType::Code deferred)
- tasks/p10/p10-1a-2-rust-ast-chunker.md: append two HOTFIXES cross-link lines in Risks/notes
- docs/superpowers/specs/2026-04-27-kebab-final-form-design.md §10.1: note p10-1A-2 surface activation
- Cargo.toml: version 0.6.0 → 0.7.0 (dogfooding-ready = minor bump trigger per CLAUDE.md)
- Cargo.lock: regenerated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:48:11 +00:00
97e9f558f4 test(p10-1a-2): code-rust-ast-v1 chunker snapshot + full-suite gate
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:14:57 +00:00
da51e59081 feat(p10-1a-2): populate schema.v1 code_lang_breakdown
Add `SqliteStore::code_lang_breakdown()` that queries
`json_extract(metadata_json, '$.code_lang')`, groups by it, and
skips NULL rows — returning `BTreeMap<String, u32>`.

Wire it into `collect_stats` in `kebab-app::schema`, replacing the
`BTreeMap::new()` placeholder inserted by 1A-1.

Test: `store::tests::code_lang_breakdown_counts_by_code_lang` asserts
rust=1 and that a null-code_lang doc does NOT appear in the map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:41:52 +00:00
11a0fc758f docs(p10-1a-2): note backfill invariant at search_with_opts non-trace path (Task 8 review)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:20:13 +00:00
b5d1fe8c1e feat(p10-1a-2): backfill SearchHit.repo from doc metadata (Task 8b)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:13:01 +00:00
580576c2c6 feat(p10-1a-2): wire code ingest dispatch (ingest_one_code_asset)
Add `MediaType::Code("rust")` dispatch arm in `ingest_one_asset`,
`ingest_one_code_asset` fn (faithful mirror of `ingest_one_pdf_asset`),
and `backfill_code_lang` post-processing in `App::search_uncached`.
Integration test `code_ingest_smoke.rs` verifies full pipeline:
ingest `.rs` → Citation::Code hit with lang/symbol/line_start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 20:14:59 +00:00
808b92a6c5 feat(p10-1a-2): code-rust-ast-v1 chunker (1:1 + oversize split)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:40:11 +00:00
c74f8d269e chore(p10-1a-2): sync Cargo.lock for kebab-parse-code deps (Task 6 follow-up)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:36:54 +00:00
df85bafa7f fix(p10-1a-2): module-prefix glue symbols + crate desc + invariant hardening (Task 6 review)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:35:52 +00:00
a93b33ffbe fix(p10-1a-2): correct <module> label scope + de-dup leading attribute (Task 6 review)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:28:49 +00:00
402a4506a2 feat(p10-1a-2): tree-sitter-rust AST extractor (parser-side)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:22:09 +00:00
a531dc37dc feat(p10-1a-2): route .rs files to MediaType::Code(rust)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:17:36 +00:00
7a6a24ad10 feat(p10-1a-2): add MediaType::Code(lang) variant
TDD: red → green cycle confirmed. New `Code(String)` variant serializes
as `{"code":"rust"}` via serde `rename_all = "lowercase"`. All exhaustive
`match` sites updated (`media_label`, `ingest_one_asset` catch-all →
explicit or-pattern). Design §3.5 enum listing synced. Also fix
`/target` symlink gitignore pattern so integration-test binary lookup
via workspace-relative path works with CARGO_TARGET_DIR redirect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:14:45 +00:00
42712b50c2 feat(p10-1a-2): map SourceSpan::Code -> Citation::Code in citation_helper
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 15:59:03 +00:00
9f3edb7e24 feat(p10-1a-2): add internal SourceSpan::Code variant + design §3.4 sync
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 15:52:01 +00:00
5c265bb59f build(p10-1a-2): add tree-sitter + tree-sitter-rust workspace deps
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 15:38:19 +00:00
a08ed32199 docs(p10-1a-2): task spec + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 15:36:08 +00:00
9362cd0aae Merge pull request 'feat(p10-1a-1): code ingest framework — wire schema + parse-code crate + filter flags' (#139) from spec/p10-code-ingest-design into main
Reviewed-on: #139
2026-05-15 09:31:26 +00:00
th-kim0823
7961f8813d fix(p10-1a-1): PR review round 1 — doc inconsistencies
회차 1 review 의 4 건 actionable 모두 반영:

1. frozen design §2.1 의 code variant 예시에서 존재하지 않는 `repo` 필드 제거 + nested form 에서 actual wire (flat) 형태로 정리. 5 variant 의 nested-form illustrative example 은 그대로 두고, code variant 만 별도 block 으로 분리해서 actual wire 와 1:1 매칭. 또 위쪽 6 variant nested-form group 에서도 'code' 행 삭제 (정확한 contract 는 별도 block 에 있음).
2. §2.2 SearchHit 예시의 `repo: null, code_lang: null` + 'omitted when null' 주석 모순 제거 — 키 자체를 빼고 inline 주석으로 'markdown hit 에는 absent, 코드 hit 에서만 surface' 설명.
3. HANDOFF Phase row 식별자 `**10**` → `**P10**` (다른 row 와 일관성).
4. README synopsis 의 중복 `[--media code]` 제거 (`--media` 는 이미 위쪽에 한 번 있음, code 는 값 중 하나라 prose 에서 설명).

코드 변경 없음 — 모두 markdown 문서.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:24:15 +09:00
th-kim0823
7bbd2c0cbf docs(p10-1a-1): wire schema + frozen design + README/HANDOFF/SMOKE + task index
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 17:41:26 +09:00
th-kim0823
d13f58d28a fix(p10-1a-1): patch wire.rs Stats fixture for new schema fields
Task 16's new code_lang_breakdown / repo_breakdown fields broke the existing schema_wrapper_tags_schema_version test in wire.rs which constructs Stats { ... } literally. Use ..Default::default() since Stats now derives Default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 17:30:01 +09:00
th-kim0823
298f4adc81 feat(p10-1a-1): CLI filter flags + SchemaStats breakdowns + regression tests
Task 13: add wire regression tests proving markdown SearchHit omits
repo/code_lang when None, and all 5 original Citation variants serialize
byte-identically without spurious Code-variant keys.

Task 15: add --repo (repeatable) and --code-lang (repeatable,
comma-separated) flags to `kebab search`; propagate both into
SearchFilters instead of the previous vec![] stub. Add
#[allow(clippy::large_enum_variant)] — Cmd is short-lived, boxing buys
nothing.

Task 16: add code_lang_breakdown and repo_breakdown BTreeMap fields to
Stats (schema.v1); derive Default on Stats; populate both as empty in
collect_stats (1A-2 fills them when code chunks land). Add unit test
asserting both keys are present in the serialized object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 17:21:59 +09:00
th-kim0823
4e8b70a04b feat(p10-1a-1): apply generated-header + size-cap skip per file
Wire kebab_parse_code::is_generated_file and is_oversized into
FsSourceConnector::scan_with_skips. Files that pass gitignore/builtin/
kebabignore matching are now checked for generated-file markers
(config-gated via ingest.code.skip_generated_header) and byte/line caps
(ingest.code.max_file_bytes / max_file_lines). FsScanSkips gains
skipped_generated + skipped_size_exceeded counters; kebab-app threads
them into IngestReport. Also fixes a pre-existing clippy::derivable_impls
warning in IngestCfg. Three new connector tests cover all three paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 17:06:59 +09:00
th-kim0823
682f7dd3a2 feat(p10-1a-1): add [ingest.code] config section
Add IngestCfg + IngestCodeCfg structs with serde defaults and embed
ingest: IngestCfg into the top-level Config. Existing configs without
an [ingest] section continue to load unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:53:21 +09:00
th-kim0823
40b3ea8408 chore(p10-1a-1): cleanup Task 11 review findings + sync Cargo.lock
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:50:55 +09:00
th-kim0823
9fce24b106 feat(p10-1a-1): wire IngestReport skip counters by category (gitignore/builtin/kebabignore)
Refactor walker to expose WalkOverrides (combined + per-source matchers),
add walk_files_with_skips that returns accepted files alongside skip
attribution, wire FsSourceConnector::scan_with_skips into kebab-app so
IngestReport.skipped_gitignore, skipped_kebabignore, skipped_builtin_blacklist,
and skip_examples are populated instead of left at zero. Priority order
per spec §5.2 (builtin > gitignore > kebabignore) enforced in classify_skip,
with a directory-aware builtin matcher so pruned directory entries are
correctly attributed to builtin rather than a coincident gitignore entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:42:28 +09:00
th-kim0823
8bbe25dc10 fix(p10-1a-1): guard .gitignore negation + sync doc comments
Prevent double-`!` corruption when a `.gitignore` negation pattern
(e.g. `!keep/`) hits the trailing-slash normalizer in `read_gitignore`.
Also updates module-level and `build_overrides` doc to list all five
filter sources in application order, and adds a regression test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:30:00 +09:00
th-kim0823
abfdcbd31d feat(p10-1a-1): honor repo-root .gitignore in walker overrides
Adds read_gitignore() (pub(crate), root-only, nested cascade deferred)
and merges its patterns as a 5th group in build_overrides(). Trailing-
slash patterns (dist/) are normalized to also emit a stem/** glob so
files inside the directory are matched when is_dir=false. Two new tests
cover both the happy path and the missing-file no-op.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:25:19 +09:00
th-kim0823
69d1593bc5 feat(p10-1a-1): integrate built-in blacklist into walker overrides
Wires `kebab_parse_code::BUILTIN_BLACKLIST` (6 patterns: node_modules,
target, __pycache__, .venv, venv, env) into `build_overrides()` so the
walker automatically excludes these directories even when the user has
no `.kebabignore`. TDD cycle: 2 failing tests added first, then the
pattern-add loop inserted after the existing kbignore block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:13:39 +09:00
th-kim0823
2a8451c033 fix(p10-1a-1): tighten kebab-parse-code manifest + tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:05:34 +09:00
th-kim0823
ff11f81f7f feat(p10-1a-1): kebab-parse-code crate (lang + repo + skip)
Tasks 5-8: new `kebab-parse-code` crate with three infrastructure modules
for the code ingest framework. Ships lang.rs (extension→language identifier
mapping), repo.rs (.git walk-up via gix 0.70 for RepoMeta), and skip.rs
(BUILTIN_BLACKLIST, is_generated_file, is_oversized). 14 integration tests
across three test files, all passing; clippy -D warnings clean.

Note: gix pinned to 0.70 (not 0.83 as originally suggested) because 0.83
fails to compile against Rust 1.94.1 due to non-exhaustive match patterns
in gix-hash. 0.70 resolves cleanly and has identical head_name/head_id API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:57:59 +09:00
th-kim0823
bf4ebf8d2a feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang
Four optional, serde-skipped-when-None fields added to `Metadata` for
code ingest context. All 11 downstream construction sites patched with
`repo: None, git_branch: None, git_commit: None, code_lang: None`.
Full workspace check (`--tests`) and per-crate test suite pass clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:44:18 +09:00
th-kim0823
351c7a0826 feat(p10-1a-1): add IngestReport skip counters + SkipExamples
Adds five new u32 counters (skipped_gitignore, skipped_kebabignore,
skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded)
and a SkipExamples struct (≤5 sample paths per category) to
IngestReport. All new fields are #[serde(default)] for backward-compat
deserialization. Downstream literal construction sites patched with
zeros/empty; snapshot re-baked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:28:19 +09:00
th-kim0823
7329ba96ee fix(p10-1a-1): patch missed SearchHit test-only construction sites
Add repo: None, code_lang: None to the 3 SearchHit struct literals
inside #[cfg(test)] blocks that were missed by the fa4eeb5 sweep.
2026-05-15 15:17:10 +09:00
th-kim0823
fa4eeb5a87 feat(p10-1a-1): add SearchHit.repo / code_lang + SearchFilters.repo / code_lang
Wire two new optional fields onto SearchHit (skip_serializing_if = None)
and two Vec<String> filter fields onto SearchFilters (serde default).
Add RetrievalDetail::Default impl (manual, uses SearchMode::Hybrid as
sentinel). Patch all downstream SearchHit / SearchFilters literal
constructors with repo: None / code_lang: None / vec![] as appropriate.
Also covers Citation::Code arm in kebab-eval metrics match.
2026-05-15 15:04:23 +09:00
th-kim0823
3b1e878aed feat(p10-1a-1): add Citation::Code variant
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 14:39:18 +09:00
th-kim0823
005a9011ea plan(p10-1a-1): code ingest framework implementation plan + spec wire-shape fix
21 task plan: kebab-core 도메인 타입 (Citation::Code variant, SearchHit repo/code_lang, IngestReport skip counters, Metadata extension), 새 kebab-parse-code crate (lang/repo/skip 모듈, gix dep), kebab-source-fs gitignore+blacklist 통합, kebab-config [ingest.code] 절, kebab-cli --repo/--code-lang flag, wire schema JSON 갱신, frozen design doc 갱신, README/HANDOFF/SMOKE 갱신, task index. 각 task 가 5-step TDD cycle (test fail → impl → pass → commit). 코드 chunker 는 1A-1 에 없음 — 1A-2 에서 추가.

spec 의 Citation::Code 예시가 기존 5 variants 의 flat wire 형태와 안 맞아서 (`code: {...}` 중첩이 아니라 top-level field) 같이 fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 14:31:22 +09:00
th-kim0823
c6d61b0b37 spec(p10): split Phase 1A into 1A-1 (framework) and 1A-2 (Rust chunker)
1A 가 들고 들어가는 *프레임워크 surface* (Citation `code` variant, SearchHit repo/code_lang, --media code / --code-lang / --repo filter, skip 정책, IngestReport 세분화, config 절, kebab-parse-code crate skeleton) 가 *언어 chunker 자체* 와 독립 검증 가능 — 1A-1 머지 후 기존 markdown corpus 의 wire 출력이 byte-level identical 한지 regression test 로 검증한 다음 1A-2 에서 Rust AST chunker 자체에 집중. binary version bump 트리거도 1A-2 로 미룸 (1A-1 은 wire additive minor + 사용자 surface 변경 없음).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 14:20:10 +09:00
th-kim0823
49487dc46b spec(p10): code ingest design — Tier 1 AST + Tier 2 resource + Tier 3 fallback
수십 개 git repo (한 부모 dir 아래) 를 corpus 로 확장. Tier 1 (Rust/Python/TS-JS/Go/Java/Kotlin/C/C++) 은 tree-sitter AST per-language chunker, Tier 2 (k8s manifest / Dockerfile / Cargo.toml 류) 는 resource-aware chunker, Tier 3 (shell / fallback) 는 paragraph + line-window. embedding 은 multilingual-e5-large 유지 — cross-corpus 검색 위해. Phase 1A (Rust) 부터 1D (C/C++) + Phase 2 (Tier 2) + Phase 3 (Tier 3) 순으로 진행. ignore 통합 (.gitignore honor + .kebabignore 추가 + 최소 built-in safety net), generated header sniff, size cap 으로 첫 도그푸딩 비용 차단. 새 Citation variant `code`, SearchHit 의 repo/code_lang 필드, --media code / --code-lang / --repo filter — 모두 additive minor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 14:15:59 +09:00
2c2bf9bac5 Merge pull request 'docs(claude): cargo clean routinely between merges' (#135) from chore/cargo-clean-cadence into main
Reviewed-on: #135
2026-05-10 15:02:00 +00:00
72798bd3ff Merge pull request 'chore: bump version 0.5 → 0.6' (#138) from chore/bump-v0.6.0 into main
Reviewed-on: #138
2026-05-10 15:01:45 +00:00
th-kim0823
c3177561b9 chore: bump version 0.5 → 0.6
v0.6.0 batches RAG quality batch:
- fb-38 score semantics (search_hit.v1 score_kind)
- fb-40 fact-grounded answer (rag-v2 prompt template)
- fb-42 bulk multi-query (kebab search --bulk + mcp__kebab__bulk_search)
- fb-39 eval foundation (precision_at_k_chunk metric)
- fb-39b embedding upgrade (multilingual-e5-large default)

embedding_version cascade triggers minor bump per design §9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:56:51 +09:00
a465b71f99 Merge pull request 'feat(fb-39b): embedding upgrade — multilingual-e5-large default' (#137) from feat/fb-39b-embedding-upgrade into main
Reviewed-on: #137
2026-05-10 14:53:21 +00:00
th-kim0823
787007172a fix(fb-39b): address PR #137 round 2 review
- target_version 0.7.0 → 0.6.0 (current Cargo.toml = 0.5.0;
  embedding_version cascade bumps to 0.6, not 0.7)
- 요약 bullet "0.6 → 0.7" → "0.5 → 0.6" 정정

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:47:47 +09:00
th-kim0823
b954e9ce66 fix(fb-39b): address PR #137 round 1 review
- CI-only embed_model.rs tests updated 384 → 1024 + e5-small → e5-large
  references (incl. file header download size, snapshot dim assert,
  L2 norm comment)
- kebab-embed-local module docs + Cargo.toml description list both
  models (small + large)
- Stale tracing message expanded with both model sizes
- Task spec Post-merge deviation section: record dropped
  embedding_dim_mismatch ErrorV1 + reason (LanceDB (model, dim)
  namespacing makes hard-error redundant)
- Task spec + HOTFIXES version bump 0.6→0.7 corrected to 0.5→0.6
  (current Cargo.toml = 0.5.0; fb-42 0.6 cut deferred per user
  direction)
- HOTFIXES "embedding_version bump 아님" line corrected — cascade rule
  DOES trigger release bump, plus deviation note for the dropped error

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:45:55 +09:00
th-kim0823
c62a8ff503 docs(fb-39b): design + HOTFIXES + new task spec + INDEX + README + SMOKE
Tasks 4 + 5: comprehensive doc update for embedding upgrade (multilingual-e5-large).

- design §5 + §9: update embedding_model / dimensions references (384 -> 1024)
- HOTFIXES: add fb-39b entry with user re-ingest procedure + backwards-compat notes
- tasks/p9-fb-39b-embedding-upgrade.md: new task spec (completed status)
- INDEX.md: add fb-39b row under RAG quality phase
- fb-39 task banner: append fb-39b link as lever implementation
- README: update config defaults + fastembed model size + embedding field docs
- SMOKE.md: append embedding upgrade verification section with e5-small -> e5-large sequence

Wire schema: no change (additive at config level, new table created by existing code).
Binary version: 0.6.0 -> 0.7.0 (cascade rule: embedding_model change = minor bump).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:28:48 +09:00
th-kim0823
69c94b6692 feat(embed,config): add multilingual-e5-large + flip default config (fb-39b)
Task 1: Add multilingual-e5-large arm to kebab-embed-local::resolve_model with tests for 1024-dim variants and error cases.

Task 2: Flip kebab-config defaults from e5-small (384-dim) to e5-large (1024-dim) across defaults(), test assertions, and TOML template.

All tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:05:36 +09:00
th-kim0823
d5321701ea plan(fb-39b): embedding upgrade implementation plan
5 tasks: kebab-embed-local resolve_model arm + check_dim test,
kebab-config defaults + TOML template flip, cross-crate fixture
sweep (likely no-op since most tests use provider=none), docs
(design + HOTFIXES + new task spec + INDEX), README + SMOKE
walkthrough.

Post-merge: 0.6 → 0.7 binary bump per CLAUDE.md cascade rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:02:37 +09:00
th-kim0823
2c3461c465 spec(fb-39b): embedding model upgrade design
- multilingual-e5-small (384 dim) → multilingual-e5-large (1024 dim)
- Cascade: embedding_version bump → fb-23 incremental ingest
  re-embeds all chunks
- Migration policy: dim mismatch detection at LanceVectorStore::open
  → error.v1 (code = embedding_dim_mismatch) + hint
  "kebab reset --vector-only && kebab ingest"
- Config defaults flip (model + dimensions). User TOML pinning small
  preserves backwards-compat
- bge-m3 deferred (fastembed enum 미포함, UserDefinedEmbeddingModel
  ONNX path 별도)
- Release trigger: 0.6 → 0.7 minor bump per CLAUDE.md cascade rule

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:59:03 +09:00
240120ee80 Merge pull request 'feat(fb-39): eval foundation — precision_at_k_chunk metric' (#136) from feat/fb-39-eval-foundation into main
Reviewed-on: #136
2026-05-10 13:41:04 +00:00
th-kim0823
5870a1de15 fix(fb-39): address PR #136 round 1 review
kebab eval compare now surfaces precision_at_k_chunk delta in both
human-readable table + deltas JSON. Snapshot fixture regenerated
additively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:39:11 +09:00
th-kim0823
f00fb376fe docs(fb-39): golden header + design §10.3 eval + spec status + INDEX
Strengthen fixtures/golden_queries.yaml header with precision_at_k_chunk
explanation + measurement guidance. Add §10.3 Eval metrics section to
frozen design documenting retrieval metrics (hit@k, MRR, recall@k_doc,
P@k_chunk) + groundedness metrics. Flip p9-fb-39 spec status from open
→ completed (eval foundation only, lever deferral noted). Update
tasks/INDEX.md fb-39 row mirror to fb-42 (merged, deferred note).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:35:15 +09:00
th-kim0823
bb0ec0469f feat(eval): precision_at_k_chunk metric (P@5, P@10) (fb-39) 2026-05-10 22:26:21 +09:00
th-kim0823
f303c76f52 plan(fb-39): eval foundation implementation plan
4 tasks: AggregateMetrics.precision_at_k_chunk field + serde
backwards-compat, compute aggregation in loop with 5 unit tests,
golden YAML header doc strengthening, design §11 + INDEX + status
flip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:19:44 +09:00
th-kim0823
cd5b1e3bfc spec(fb-39): eval foundation design (P@k metric)
- AggregateMetrics 에 precision_at_k_chunk: BTreeMap<u32, f32>
  (P@5, P@10) 추가, binary relevance via expected_chunk_ids
- Denominator = k 고정 (hits.len() < k 도 precision 손실 간주)
- Empty expected_chunk_ids query 는 skip (hit_at_k 동일 정책)
- Lever 적용 (chunk policy / RRF / cross-encoder / embedding) 은
  본 spec 범위 외 — fb-39b 이후 별도 task
- Golden set schema 무변경, shipped fixtures 헤더 주석만 강화

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:05:09 +09:00
th-kim0823
7c6c2e8102 docs(claude): cargo clean routinely between merges
target/ balloons to 90+ GB after a few task cycles (fb-* batches
accumulate). User reported disk full mid-session twice — strengthen
guidance from "if pressure shows up" to "routinely after each merged
PR".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:48:43 +09:00
3a9a52326d Merge pull request 'feat(fb-42): bulk multi-query — kebab search --bulk + mcp__kebab__bulk_search' (#134) from feat/fb-42-bulk-multi-query into main
Reviewed-on: #134
2026-05-10 12:27:11 +00:00
th-kim0823
b53376e96e fix(fb-42): address PR #134 round 1 review
- print_schema_text plain mode: include bulk_search capability row
- README: tool count 7 → 8, fetch added to MCP tool name lists

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:19:20 +09:00
th-kim0823
441f1192ee docs(fb-42): wire schema + README + SMOKE + design + SKILL + INDEX
- Add bulk_search_item.v1 + bulk_search_response.v1 wire schemas
- Register both in WIRE_SCHEMAS const
- README: --bulk flag mention + MCP tool list 7→8 (bulk_search)
- SMOKE: bulk multi-query walkthrough (CLI + MCP equivalent)
- Design §2.2: Bulk multi-query (fb-42) subsection (additive minor)
- SKILL: mcp__kebab__bulk_search section + tool table row
- Task spec status open→completed, banner replaced
- INDEX: fb-42 row 머지 (rerank hint deferred)
- Fix: missed Capabilities {bulk_search} in cli wire.rs test (Task 7 leftover)
- Fix: missed tools.len() 7→8 in cli_mcp_smoke (Task 5 leftover)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:07:36 +09:00
th-kim0823
e8da415624 feat(schema): bulk_search capability flag (fb-42)
- Capabilities.bulk_search: true (snapshot)
- schema.v1 wire required list updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:49:09 +09:00
th-kim0823
d8e5f35601 test(mcp): integration tests for bulk_search tool (fb-42) 2026-05-10 20:33:32 +09:00
th-kim0823
6ab0d782ef feat(mcp): kebab__bulk_search tool (fb-42)
Exposes bulk multi-query search via MCP `bulk_search` tool:
- Input: { queries: [SearchInput shapes...] }, capped at 100
- Output: bulk_search_response.v1 with per-query results + summary
- Sequential execution reuses App instance for cache amortization
- Per-query errors embed error.v1 JSON; never aborts bulk call

Updates tool count from 7 to 8 in lib.rs comment + tools_list test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:31:20 +09:00
th-kim0823
2bbe94eb05 test(cli): integration tests for kebab search --bulk (fb-42) 2026-05-10 20:26:07 +09:00
th-kim0823
9ac13fa256 fix(cli): make query optional when --bulk is set (fb-42) 2026-05-10 20:26:03 +09:00
th-kim0823
67f2c16cc2 feat(cli): kebab search --bulk flag + stdin ndjson + output stream (fb-42) 2026-05-10 20:22:45 +09:00
th-kim0823
1ebbd6b711 feat(app): bulk_search_with_config facade (fb-42) 2026-05-10 20:18:49 +09:00
th-kim0823
892175d009 feat(core): BulkSearchItem / Summary / Response types (fb-42) 2026-05-10 20:12:31 +09:00
th-kim0823
de9016fe16 plan(fb-42): bulk multi-query implementation plan
8 tasks: kebab-core types, kebab-app bulk_search_with_config facade
(cap 100 + per-query error policy), CLI --bulk flag + stdin ndjson +
output stream, CLI integration tests, MCP bulk_search tool +
registration + tools_list count bump, MCP integration tests,
capability flag, wire schemas + README + SMOKE + design + SKILL +
status flip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:10:39 +09:00
th-kim0823
35df15df99 spec(fb-42): bulk multi-query design (rerank hint deferred)
- CLI: kebab search --bulk + stdin ndjson → stdout per-query ndjson
- MCP: 신규 kebab__bulk_search tool + JSON envelope (results + summary)
- Sequential for-loop, App instance 재사용 (cache amortize)
- Per-query error policy: continue + per-item error.v1
- Limits: queries.len() <= 100
- Capability flag bulk_search 신규
- Rerank hint 별도 task (fb-39 cross-encoder 설계 후)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:05:27 +09:00
b0becf43b8 Merge pull request 'chore(handoff): sync release roadmap with shipped state' (#133) from chore/sync-handoff into main
Reviewed-on: #133
2026-05-10 10:49:23 +00:00
21ecbb00d4 Merge pull request 'feat(fb-40): fact-grounded answer — rag-v2 prompt template' (#132) from feat/fb-40-fact-grounded-answer into main
Reviewed-on: #132
2026-05-10 10:49:06 +00:00
th-kim0823
8cd21e8342 chore(handoff): sync release roadmap with shipped state
- 0.3.0 batch (fb-26/27/28 + fb-29 deferral) marked cut
- 0.4.0 batch (fb-30 MCP + fb-31 single-file) marked cut
- 0.5.0 batch (fb-32..37) marked cut on 2026-05-10
- 0.6.0 in progress: fb-38 + fb-40 merged today, fb-39 pending
- fb-41/42 reframed as 0.7.0+ candidates

Note: PR #132 (fb-40) merge updates roadmap header in spec status
table (already flipped via fb-40 PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:46:28 +09:00
th-kim0823
b35f163f56 fix(fb-40): address PR #132 round 1 review
Module doc still pinned "rag-v1" — update to reflect dispatched
template via system_prompt_for (rag-v1 legacy / rag-v2 default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:42:57 +09:00
th-kim0823
600c6182fc docs(fb-40): rag-v2 prompt + README + design + SKILL + INDEX
- README: [rag] prompt_template_version default rag-v2 + V2 강화 3 규칙
- design §7: rag-v2 본문 + V1 legacy note
- SKILL.md: mcp__kebab__ask 응답 행태 변화 안내
- task spec: status open → completed, design + plan 링크
- INDEX: fb-40  머지 (2026-05-10)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:37:28 +09:00
th-kim0823
0e8b800b6b test(rag): integration tests for rag-v1/v2/unknown dispatch (fb-40)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:18:36 +09:00
th-kim0823
126559ce7a fix(fb-40): update test fixtures for rag-v2 default 2026-05-10 19:15:15 +09:00
th-kim0823
137fc4ee31 feat(config): default prompt_template_version rag-v1 → rag-v2 (fb-40) 2026-05-10 19:04:55 +09:00
th-kim0823
59f01f8185 feat(rag): pipeline reads prompt_template_version via helper (fb-40) 2026-05-10 19:02:39 +09:00
th-kim0823
9f70681b77 feat(rag): SYSTEM_PROMPT_RAG_V2 + system_prompt_for dispatch helper (fb-40) 2026-05-10 19:01:05 +09:00
th-kim0823
6d6eb442be plan(fb-40): fact-grounded answer implementation plan
6 tasks: SYSTEM_PROMPT_RAG_V2 + system_prompt_for helper, pipeline
dispatch wiring, config default flip rag-v1 → rag-v2, test fixture
cleanup, integration tests (rag-v1 / rag-v2 / unknown via
CapturingLm wrapper around MockLanguageModel), docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 18:58:35 +09:00
th-kim0823
28d3250546 spec(fb-40): fact-grounded answer design
- rag-v1 → rag-v2 system prompt with 3 신규 규칙 (verbatim span 인용 자도 /
  학습 지식 동원 금지 / 추측 금지)
- system_prompt_for(version) helper dispatch in pipeline
- config default prompt_template_version "rag-v1" → "rag-v2", V1 legacy
  kept for backwards-compat
- Lever C (pre-LLM gate) already shipped (RefusalReason::ScoreGate),
  out of scope here

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 18:55:05 +09:00
945319ae93 Merge pull request 'feat(fb-38): score semantics — score_kind on search_hit.v1 + RRF formula docs' (#131) from feat/fb-38-score-semantics into main
Reviewed-on: #131
2026-05-10 09:38:24 +00:00
th-kim0823
c864bd007f docs(fb-38): wire schema + README + design + SKILL + INDEX 2026-05-10 18:21:55 +09:00
th-kim0823
67aee9f480 test(cli): integration tests for score_kind on lexical mode (fb-38) 2026-05-10 18:12:14 +09:00
th-kim0823
4440fa6659 fix(fb-38): add score_kind to remaining SearchHit literals
Add missing score_kind field to SearchHit constructors in:
- kebab-tui/tests/search.rs::make_hit()
- kebab-eval/tests/metrics_and_compare.rs::hit()
- kebab-eval/src/metrics.rs::hit()

All test fixtures default to Rrf (hybrid mode), matching the field's
Default impl and the test semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 18:08:29 +09:00
th-kim0823
b51cdb9e8f feat(search/hybrid): fuse hits override score_kind to Rrf (fb-38) 2026-05-10 17:56:56 +09:00
th-kim0823
4e739f3cd8 feat(search): add score_kind to VectorRetriever (Cosine) and hybrid test helpers (Rrf)
This commit unblocks Tasks 3 and 4 of fb-38:
- VectorRetriever::build_hit now labels hits with ScoreKind::Cosine
- Hybrid retriever test helpers (mk_hit functions) label synthetic hits with ScoreKind::Rrf
- Updated lexical snapshot fixture to reflect new score_kind field in output

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 17:54:16 +09:00
th-kim0823
3a621bba0d feat(search/lexical): label hits with ScoreKind::Bm25 (fb-38 task 2)
- Add ScoreKind::Bm25 to LexicalRetriever::build_hit SearchHit construction
- Import ScoreKind from kebab_core in lexical.rs
- Add integration test lexical_retriever_hits_carry_bm25_score_kind to verify all
  hits from LexicalRetriever carry score_kind == ScoreKind::Bm25
- Update lexical snapshot test baseline to include new score_kind field

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 17:54:11 +09:00
th-kim0823
3c605b1a5d feat(core): ScoreKind enum + SearchHit.score_kind (fb-38) 2026-05-10 17:49:02 +09:00
th-kim0823
56f20b7235 plan(fb-38): score semantics implementation plan
7 tasks: kebab-core ScoreKind enum + SearchHit field, lexical Bm25
labeling, vector Cosine, hybrid Rrf + search_with_trace pass-through,
cross-crate SearchHit literal cleanup, CLI integration test, docs
(wire schema + README + design + SKILL + INDEX).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 17:45:57 +09:00
th-kim0823
0359bd9682 spec(fb-38): score semantics design
- search_hit.v1 에 optional score_kind 필드 (rrf | bm25 | cosine)
- LexicalRetriever → Bm25, VectorRetriever → Cosine, HybridRetriever → Rrf
- fb-37 search_with_trace 의 mode-dispatch hits 는 underlying retriever 의
  score_kind 그대로 보존
- README + design §4 + SKILL 에 RRF 수식 전체 + "ranking signal, NOT confidence"
  안내, agent 용 trust threshold 는 nested retrieval.{lexical,vector}_score
- additive minor wire — schema bump 없음

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 17:40:47 +09:00
cf3acfc136 Merge pull request 'chore: bump version 0.4 → 0.5' (#130) from chore/bump-v0.5.0 into main
Reviewed-on: #130
2026-05-10 08:08:06 +00:00
th-kim0823
668e1174cc chore: bump version 0.4 → 0.5
v0.5.0 batches fb-32 (stale doc indicator) + fb-33 (streaming ask)
+ fb-34 (output budget controls) + fb-35 (verbatim fetch) + fb-36
(search filter args) + fb-37 (trace + stats).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 17:04:51 +09:00
745a75a82b Merge pull request 'feat(fb-37): trace + stats — search debug + KB health surface' (#129) from feat/fb-37-trace-and-stats into main
Reviewed-on: #129
2026-05-10 07:59:56 +00:00
th-kim0823
6a33d08aea fix(fb-37): address PR #129 round 1 review
- doc TraceFusionInput.fusion_score semantics (single-mode vs hybrid)
- comment why total_ms vs stage sum can drift (millis truncation)
- TODO marker on TUI trace popup filter passthrough

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 16:26:34 +09:00
th-kim0823
a40593590b docs(fb-37): wire schema + README + SMOKE + INDEX + SKILL 2026-05-10 14:13:47 +09:00
th-kim0823
5687cbc0e2 feat(tui): search pane t-key opens TracePopup (fb-37) 2026-05-10 13:39:11 +09:00
th-kim0823
653e432a30 feat(mcp): kebab__search trace input + output mirror (fb-37)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 13:32:30 +09:00
th-kim0823
f7e2072d66 test(cli): integration tests for --trace + schema breakdowns (fb-37)
Also fixes App::search_with_opts trace branch to use NoopRetriever
for SearchMode::Lexical, removing the embeddings requirement when
the user only wants lexical-mode trace.
2026-05-10 13:21:33 +09:00
th-kim0823
72c227af23 feat(cli): kebab search --trace flag + wire trace + pretty print (fb-37)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 13:08:48 +09:00
th-kim0823
69037c313a feat(app): SearchResponse.trace + opts.trace threading (fb-37)
Adds the `trace: Option<SearchTrace>` field to `SearchResponse` and
threads `SearchOpts.trace` through `App::search_with_opts`. When the
caller sets `opts.trace = true` the path bypasses the LRU search cache
and runs through `HybridRetriever::search_with_trace`, which dispatches
all 3 SearchModes internally; this means `--trace` requires embeddings
(same constraint as `--mode hybrid`). The non-trace path keeps its
exact prior behavior with `trace: None` stamped on the response.

Picked up Task 1 / Task 3 follow-ups in the same commit so the
workspace compiles: SearchOpts struct-literals in kebab-cli/main.rs +
kebab-mcp/tools/search.rs default the new `trace` field to false, and
the schema-wrapper test in kebab-cli/wire.rs fills the new
media_breakdown / lang_breakdown / index_bytes / stale_doc_count fields
on Stats with `Default::default()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 13:01:18 +09:00
th-kim0823
6a067e3ab1 feat(search): HybridRetriever::search_with_trace (fb-37) 2026-05-10 12:38:53 +09:00
th-kim0823
231d80e82d feat(stats): media/lang/bytes/stale fields on schema.v1.stats (fb-37)
Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count
fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold
for callers that need real stale counts. Mirrors all new fields onto the
wire-bound Stats struct in kebab-app::schema with #[serde(default)] for
backwards-compat. Also fixes search_budget_integration.rs for the trace field
added to SearchOpts in Task 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 12:34:57 +09:00
th-kim0823
69c6e23432 feat(store): breakdowns + index_bytes helpers (fb-37)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 12:24:43 +09:00
th-kim0823
1e943f21dc feat(core): SearchTrace + IndexBytes types + SearchOpts.trace (fb-37)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 12:17:04 +09:00
th-kim0823
fb31befef1 plan(fb-37): trace + stats implementation plan
10 tasks: kebab-core types, store breakdowns/index_bytes helpers,
extended CountSummary + Stats wire mirror, HybridRetriever
search_with_trace, App SearchResponse.trace threading, CLI --trace
flag, integration tests, MCP SearchInput.trace, TUI TracePopup,
docs (wire schema + README + SMOKE + INDEX + SKILL).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 12:14:26 +09:00
th-kim0823
5f6b2fa259 spec(fb-37): trace + stats design
- search --trace boolean flag, additive optional `trace` field on search_response.v1
- HybridRetriever search_with_trace returns (hits, SearchTrace) — lex/vec/rrf_inputs + per-stage timing
- cache bypass when --trace (debug intent)
- schema.v1.stats extended with media_breakdown / lang_breakdown / index_bytes / stale_doc_count
- TUI search pane `t` keystroke opens TracePopup
- additive minor wire — no schema bump

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 12:05:31 +09:00
a0497d9c53 Merge pull request 'chore: sync Cargo.lock for kebab-mcp time dep (fb-36)' (#128) from chore/sync-cargo-lock-fb36 into main
Reviewed-on: #128
2026-05-10 02:09:03 +00:00
th-kim0823
b221686133 chore: sync Cargo.lock for kebab-mcp time dep (fb-36)
PR #127 added time = { workspace = true } to kebab-mcp/Cargo.toml
but Cargo.lock entry was not regenerated before merge. cargo build
on main locally regenerates the +time line under kebab-mcp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 11:04:17 +09:00
a72c6f307c Merge pull request 'feat(fb-36): search filter args (--media / --ingested-after / --doc-id + 4 existing)' (#127) from feat/fb-36-search-filters into main
Reviewed-on: #127
2026-05-10 02:02:24 +00:00
th-kim0823
84287d0ef6 fix(fb-36): address PR #127 round 1 review
- ingested_after: convert OffsetDateTime to UTC before formatting
  so non-Z offsets compare correctly against UTC TEXT storage
  (lexical.rs + filters.rs)
- README: --tag is repeatable-only, not csv (only --media is csv)
- test(cli): add multi-value --tag OR-within IN-list coverage
- test(store): add UTC-offset regression test for ingested_after
- mcp: use ERROR_V1_ID const instead of hardcoded "error.v1"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:47:55 +09:00
th-kim0823
6e7446861b docs(fb-36): README + SMOKE + INDEX + skill notes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:26:27 +09:00
th-kim0823
b06f4654e7 feat(mcp): kebab__search filter inputs (fb-36)
7 new optional inputs on SearchInput: tags, lang, path_glob,
trust_min, media, ingested_after, doc_id. Validation surfaces as
error.v1 code = invalid_input via StructuredError. Dispatch builds
SearchFilters from the inputs and forwards through the existing
search_with_opts_with_config facade.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:11:27 +09:00
th-kim0823
4e0379c04f test(cli): wire_search_filters — lexical-only integration tests (fb-36)
Cover: --doc-id scoping, --ingested-after validation error,
--media md alias, --tag repeatable + frontmatter parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 04:06:21 +09:00
th-kim0823
6a18847892 feat(cli): kebab search filter flags (fb-36)
7 new flags: --tag (repeatable), --lang, --path-glob,
--trust-min (value_enum), --media (csv with `md` alias),
--ingested-after (RFC3339; config_invalid on parse fail),
--doc-id. Dispatch translates clap values into SearchFilters
and propagates structured errors through the existing
StructuredError wrapper from fb-34.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:57:55 +09:00
th-kim0823
c6cc1e2bfe feat(search/vector): media / ingested_after / doc_id filters (fb-36)
filter_chunks helper in kebab-store-sqlite extended with the same 3
WHERE clauses as lexical. Vector still over-fetches k*2 then
post-filters via SqliteStore::filter_chunks; small k can return < k
hits when filters drop a lot — agent is expected to widen k or
paginate. AND combinator with existing filters.

- kebab-store-sqlite/src/filters.rs: media IN-list subquery, ingested_after
  lexicographic >= compare, doc_id equality; mirrors lexical SQL arms
- 3 direct unit tests (filter_chunks_media_type/ingested_after/doc_id)
  that run without AVX/Lance
- common/mod.rs: insert_doc / insert_doc_with_media / run_vector_search
  helpers on HybridEnv for integration-test use
- hybrid.rs: 2 new #[ignore = "requires AVX..."] integration tests
  (vector_filter_by_media, vector_filter_by_doc_id)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:50:56 +09:00
th-kim0823
86475e5ba2 fix(search/lexical): use std::iter::repeat_n (clippy)
Per code review on 2c80e2a. manual-repeat-n lint triggers
for Rust 1.94+ when repeat().take() can be expressed as
repeat_n directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:43:51 +09:00
th-kim0823
2c80e2ad91 feat(search/lexical): media / ingested_after / doc_id filters (fb-36)
SQL WHERE clause extension. media uses CASE WHEN json_type='text'
to handle both unit (\`"markdown"\`) and tuple (\`{"image":"png"}\`)
MediaType serde shapes. ingested_after relies on RFC3339 lexicographic
ordering with UTC Z (per fb-32 ingest invariant). doc_id is a simple
equality. AND combinator with existing tags / lang / trust filters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:41:02 +09:00
th-kim0823
d3f38c76e9 feat(core): SearchFilters gains media / ingested_after / doc_id (fb-36)
3 additive optional fields. #[serde(default)] preserves
backwards compat for older JSON without the new keys.
MEDIA_KINDS const exposes canonical "markdown"/"pdf"/"image"/
"audio"/"other" labels for downstream alias normalization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:36:45 +09:00
th-kim0823
31c1e05951 plan(fb-36): search filter args implementation plan
9 tasks: SearchFilters extension, lexical SQL WHERE, vector
filter_chunks mirror, CLI 7 flags, integration tests, MCP
SearchInput extension, workspace test/clippy, docs, smoke+PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:34:39 +09:00
th-kim0823
7210386699 spec(fb-36): search filter args — design
`kebab search` 에 7 flag 노출 (기존 4 + 신규 3):
- --tag (반복) / --lang / --path-glob / --trust-min (기존 SearchFilters)
- --media (csv) / --ingested-after (RFC3339) / --doc-id (신규)

filter layer = SQLite WHERE (lexical) + over-fetch+post-filter
(vector). AND 결합. wire schema 무변경 (input only).

`SearchFilters` 3 필드 additive (#[serde(default)] 로 backwards-
compat). MCP SearchInput 7 optional 필드 추가. invalid RFC3339 →
error.v1.code = config_invalid.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:26:40 +09:00
a7115be699 Merge pull request 'feat(fb-35): verbatim fetch (chunk / doc / span)' (#126) from feat/fb-35-verbatim-fetch into main
Reviewed-on: #126
2026-05-09 16:09:48 +00:00
th-kim0823
b86b763dfb fix(fb-35): address PR #126 round 2 review
- wire schema: relax effective_end.minimum 1 → 0 + expand
  description to cover line-clamp + out-of-range sentinel
  (panic-fix R1 emits Some(0) when line_start=1 and range is
  beyond doc end — schema must accept it)
- tests: tighten first-chunk-target boundary test to assert ≤ 2
  total neighbors (3-chunk doc, N=2). Strict "first chunk →
  context_before empty" not assertable until chunks.ordinal
  column lands (R1 #9 architectural caveat)
- store: trim contradiction in list_chunk_ids_for_doc warning
  comment — drop "good enough for sequentially chunked
  markdown" phrase that conflicts with "hash sort dominates"
  paragraph above

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:55:29 +09:00
th-kim0823
7dddc1d706 fix(fb-35): address PR #126 round 1 review
- fetch_span: panic-fix on line_start > total / empty doc
  (return empty text + effective_end = line_start - 1 instead of
  out-of-bounds slice)
- truncated: reserved for budget-driven truncation only; line
  range clamp signaled via effective_end < line_end
- spec / SKILL.md / README: align rejection wording to "PDF /
  audio" (matches code; Image OCR allowed for span)
- store: warning comment on list_chunk_ids_for_doc — chunk_id
  hash sort does NOT preserve document position; real fix is a
  chunks.ordinal column, tracked as follow-up
- surrounding_chunks: saturating_add to defend against u32::MAX
  context arg on 32-bit targets
- tests: line_start > total returns empty + chunk context at
  doc boundary clamps lower bound

Deferred nits (follow-up): table-separator strict CommonMark form;
MCP per-mode strict validation; CLI chunk_id truncation in plain
output. None block correctness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:45:29 +09:00
th-kim0823
2a6b3dc7e6 docs(fb-35): README + SMOKE + INDEX + skill notes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:21:35 +09:00
th-kim0823
8d8f1c0294 test(cli): bump expected MCP tool count 6 → 7 for fb-35 fetch
cli_mcp_initialize_then_tools_list asserts the exact tools[]
count returned by tools/list. fb-35 added kebab__fetch as the
7th tool — bump the assertion accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:20:59 +09:00
th-kim0823
77bf19566c feat(mcp): kebab__fetch tool — chunk / doc / span (fb-35)
Mirrors CLI surface: same input shape, same fetch_result.v1
output. invalid_input error for missing kind-specific fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:11:37 +09:00
th-kim0823
beb40249a3 test(cli): wire_fetch — chunk/doc + chunk_not_found integration (fb-35)
3 lexical-only integration tests: chunk JSON shape, doc truncated
with --max-tokens, unknown chunk_id returns error.v1 with
code = chunk_not_found.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:06:14 +09:00
th-kim0823
0fffd69071 feat(cli): kebab fetch chunk / doc / span (fb-35)
JSON output is fetch_result.v1; plain output is human-friendly
labeled sections (chunk: before / target / after; doc/span: full
text + stderr truncated hint).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:01:56 +09:00
th-kim0823
1b9d89eb3a feat(app): App::fetch span mode + PDF/audio rejection (fb-35)
Line-based slice over fmt_canonical_to_markdown output.
PDF / audio source_type → span_not_supported StructuredError.
Out-of-range line_end clamps to total; effective_end reflects
post-budget trim. invalid_input on zero / inverted bounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:54:22 +09:00
th-kim0823
7d1f855f7e feat(app): App::fetch doc mode with budget (fb-35)
Walks CanonicalDocument blocks, serializes to markdown, applies
chars/4 budget when opts.max_tokens is set. doc_not_found
preserved through StructuredError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:48:40 +09:00
th-kim0823
610d29f053 feat(app): App::fetch chunk mode + markdown serializer (fb-35)
Chunk mode + +-N context. doc / span modes return placeholder
errors (filled by subsequent tasks). fmt_canonical_to_markdown
helper introduced now since doc mode (Task 4) consumes it.
Errors are typed StructuredError so classify preserves
chunk_not_found / doc_not_found through the wire layer.

Adds SqliteStore::list_chunk_ids_for_doc so the facade can derive
+-N neighbors without leaking direct rusqlite usage into kebab-app.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:44:51 +09:00
th-kim0823
75eeae3933 feat(wire): fetch_result.v1 schema (fb-35)
Discriminated by kind (chunk / doc / span). Per-kind required
fields enforced by description prose at v1 stub stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:36:19 +09:00
th-kim0823
9653592c16 feat(core): FetchQuery / FetchOpts / FetchResult / FetchKind (fb-35)
Domain types for `kebab fetch` 3 modes (chunk / doc / span). All
types Serialize so wire layers hand them through serde_json
directly. FetchKind is snake_case-renamed to match the wire
discriminator literal in fetch_result.v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:35:21 +09:00
th-kim0823
353aa5cc78 plan(fb-35): verbatim fetch implementation plan
11 tasks: domain types, wire schema, App::fetch chunk/doc/span
modes (3 separate tasks for incremental TDD), CLI subcommand,
CLI integration tests, MCP tool, workspace+clippy gate, docs,
smoke+PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:31:29 +09:00
th-kim0823
4eda9c317d spec(fb-35): verbatim fetch — design
`kebab fetch chunk|doc|span` 신규 subcommand + MCP `kebab__fetch`
tool. wire = `fetch_result.v1` (kind discriminator).

source = CanonicalDocument / chunks.text 정규화된 markdown (raw
bytes 미노출). chunk mode `--context N` = ordinal ±N. doc/span
mode = fb-34 budget 재사용 (chars/4). PDF/audio span 은
`error.v1.code = span_not_supported` 거절.

신규 error codes: chunk_not_found / doc_not_found /
span_not_supported / invalid_input. fb-34 StructuredError
wrapper 재사용.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:21:01 +09:00
9817a3de59 Merge pull request 'feat(fb-34): output budget controls' (#125) from feat/fb-34-output-budget-controls into main
Reviewed-on: #125
2026-05-09 12:52:36 +00:00
th-kim0823
e084b306e5 fix(fb-34): align next_cursor semantics with docs (PR #125 round 2)
Previous round-1 fix dropped the speculative cursor branch on
the truncated path, leaving a contradiction with the docs:
- snippet-only shrunk → cursor emitted (returned == k_effective)
- k-popped → cursor null (returned < k_effective)
But docs promised the opposite.

R2 resolution: emit cursor whenever more hits may be reachable
(either retriever filled the page OR budget popped hits — the
popped ones remain fetchable from offset+returned). Drop the
artificial "widen vs paginate" copy; truncated and next_cursor
are now independent signals — caller may do either or both.

Updates: app.rs::search_with_opts logic + SearchResponse doc +
schema description + SKILL.md two bullets + max_tokens=0 test
asserts cursor IS emitted on k-pop case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 21:07:04 +09:00
th-kim0823
f485608108 fix(fb-34): address PR #125 round 1 review
- error_wire: StructuredError wrapper preserves ErrorV1 through
  anyhow → classify pipeline. Adds downcast short-circuit so
  cursor::decode's typed code = "stale_cursor" reaches the wire
  instead of being string-formatted to code = "generic".
- app: search_with_opts now wraps cursor::decode error in
  StructuredError instead of anyhow! string format.
- test: error_wire pins both negative (bare anyhow → not
  stale_cursor) AND positive (StructuredError → stale_cursor)
  invariants. CLI integration test runs end-to-end and asserts
  error.v1.code on stderr.
- app: next_cursor only emitted on full-page (k-pop) path; drop
  speculative emit on snippet-only truncation that would point at
  a different page than the agent expected.
- cursor: differentiate malformed-base64 / malformed-payload /
  revision-mismatch error messages; all keep code = stale_cursor.
- test: cursor_rejected fixture uses .expect() to fail loud on
  cursor non-emission instead of silent skip.
- test: max_tokens=0 → 1-hit floor + truncated=true.
- docs: SKILL.md + schema description distinguish snippet-shrink
  (widen) vs k-pop (paginate) truncated cases. HOTFIXES notes
  --no-cache semantic shift (cached path + clear vs uncached path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:49:27 +09:00
th-kim0823
9f076003e2 docs(fb-34): README + SMOKE + INDEX + HOTFIXES + skill notes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:20:58 +09:00
th-kim0823
e1fcea6313 chore: clippy fix for fb-34 — allow result_large_err on cursor::decode
ErrorV1 is the workspace wire error struct; boxing here would
force every call site to deref through a Box for no win — the
err-path is rare. Single allow at the function level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:20:36 +09:00
th-kim0823
5e0cff1b92 feat(mcp): search tool emits search_response.v1 + budget inputs (fb-34)
SearchInput gains max_tokens / snippet_chars / cursor (all optional).
Output wrapped in search_response.v1 to match CLI; existing
tools_call_search test updated to read v["hits"] instead of the bare
array.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:12:05 +09:00
th-kim0823
603061fb86 test(cli): wire_search_response + budget integration (fb-34)
4 lexical-only tests covering search_response.v1 wrapper shape,
--max-tokens truncation, --cursor pagination, plain stderr hint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:09:01 +09:00
th-kim0823
21220f6d39 feat(cli): kebab search --max-tokens / --snippet-chars / --cursor (fb-34)
JSON output wrapped in search_response.v1 (breaking — agent must
adapt). Plain output unchanged + [truncated; use --cursor X]
stderr hint when budget tripped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:02:50 +09:00
th-kim0823
f25ad31741 feat(wire): search_response.v1 schema (fb-34)
Wrapper around search_hit.v1[] with next_cursor + truncated.
Wire breaking — agent that parses bare array must adapt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 18:00:58 +09:00
th-kim0823
af80cedd81 feat(app): App::search_with_opts + SearchResponse (fb-34)
Budget loop: snippet shorten → k pop → ≥1 hit floor. Cursor
encode/decode threads corpus_revision; mismatch surfaces as
stale_cursor anyhow error. App::search retained as thin wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:59:48 +09:00
th-kim0823
aabe66f5e2 docs(error_wire): note stale_cursor convention (fb-34)
stale_cursor is built by cursor::decode, not classify. Test
locks the invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:50:39 +09:00
th-kim0823
ebbc3a46ae feat(app): cursor encode/decode for paginated search (fb-34)
Opaque base64(JSON{offset, corpus_revision}). Mismatch or
malformed input returns ErrorV1 with code = stale_cursor.
base64 promoted to workspace dep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:49:23 +09:00
th-kim0823
e00418537f feat(core): SearchOpts domain type for budget controls (fb-34)
3 optional knobs (max_tokens, snippet_chars, cursor); Default = all
None = no enforcement (backwards-compat existing search behavior).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:46:40 +09:00
th-kim0823
dbb7b54d5d plan(fb-34): output budget controls implementation plan
11 tasks: SearchOpts (kebab-core), cursor module + base64 dep
(kebab-app), error_wire stale_cursor convention, App::search_with_opts
+ SearchResponse + budget loop, wire schema search_response.v1, CLI
flags + plain truncated hint, CLI integration tests, MCP wrapper +
inputs, workspace+clippy gate, docs (README/SMOKE/INDEX/HOTFIXES/
skill), smoke+PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:43:26 +09:00
th-kim0823
a80f65c6f2 spec(fb-34): output budget controls — design
`kebab search` 에 --max-tokens / --snippet-chars / --cursor 신규.
chars/4 token approximation. truncate priority: snippet → k → 멈춤
(최소 1 hit 보장). cursor = opaque base64(offset + corpus_revision)
— mismatch 시 error.v1.code = stale_cursor.

wire breaking: stdout array → search_response.v1 wrapper. agent 갱신
필요. App::search 시그니처는 thin wrapper 로 보존 (TUI 무영향).

ask path 는 scope out (rag.max_context_tokens 가 이미 budget 담당).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:36:51 +09:00
a9ff122ab2 Merge pull request 'feat(fb-33): streaming ask (ndjson delta)' (#124) from feat/fb-33-streaming-ask into main
Reviewed-on: #124
2026-05-09 07:33:29 +00:00
th-kim0823
225831ffcd fix(fb-33): correct HOTFIXES cross-reference per PR #124 round 2
Pointed at the actual fb-33 design spec path + clarified that
the AskOpts type widening is a byproduct of the new wire schema
forcing single-sink 3-stage transport, not a stand-alone breaking
change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:52:51 +09:00
th-kim0823
a082b78f8e fix(fb-33): address PR #124 round 1 review
- pipeline: refresh module docstring step 5 to reflect new cancel
  semantics (RetrievalDone/Token/Final + LlmStreamAborted)
- wire schema: spell out refusal-path behavior in answer_event.v1
  description (only retrieval_done emitted; no final)
- test: factual comment on relax_score_gate-using test corrected
- test: new Ollama-gated stream_score_gate_refusal_emits_only_retrieval_done
- test: new ask_emits_no_final_when_cancelled_mid_stream pinning
  the no-Final invariant on cancel
- pipeline: large_enum_variant comment broadened to acknowledge
  RetrievalDone.hits as the dominant per-emit cost
- HOTFIXES: log AskOpts.stream_sink internal API break per spec
  contract policy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:46:04 +09:00
th-kim0823
e1c6b7055a docs(fb-33): README + SMOKE + INDEX + skill notes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:22:09 +09:00
th-kim0823
39bf0de949 test(cli): wire_ask_stream — stderr ndjson + stdout final + BrokenPipe cancel (fb-33)
Three Ollama-gated integration tests covering:
- stderr lines parse as answer_event.v1 (retrieval_done first,
  final last, all carry RFC3339 ts).
- stdout final line is answer.v1 (backwards compat).
- non-stream path (--json without --stream) unchanged.
- BrokenPipe stderr → child terminates cleanly via cancel
  propagation through pipeline SendError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:14:00 +09:00
th-kim0823
29629e6786 feat(cli): kebab ask --stream emits ndjson on stderr (fb-33)
Background-thread driver runs ask_with_config; main thread
drains the receiver, serializes each StreamEvent to ndjson on
stderr. BrokenPipe → drop receiver → pipeline SendError →
cancel + LlmStreamAborted refusal. Final stdout line is the
existing answer.v1 (ingest_progress.v1 backwards-compat
pattern).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:03:41 +09:00
th-kim0823
e8caf2a57e feat(wire): answer_event.v1 schema (fb-33)
Discriminated ndjson event for `kebab ask --stream`. Mirrors
the ingest_progress.v1 pattern (stderr stream + stdout final
answer.v1 for backwards compat).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:58:49 +09:00
th-kim0823
e5c99f5b80 feat(tui): adapt ask worker to StreamEvent sink (fb-33)
Worker channel now carries kebab_app::StreamEvent. drain_stream
matches on Token { delta }; RetrievalDone and Final are ignored
(citations render from last_answer, Final is redundant with
worker join). app::AskState.rx type widened to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:57:46 +09:00
th-kim0823
307fd8d527 feat(rag): pipeline emits StreamEvent + cancel on SendError (fb-33)
RetrievalDone after retrieve+stale-stamp, Token per LM chunk
(SendError → break, FinishReason::Cancelled, RefusalReason::
LlmStreamAborted), Final on success. answers row still persists
on cancel for audit. Adds FinishReason::Cancelled, re-exports
StreamEvent from kebab_rag, migrates two pre-fb-33 sink tests
in tests/pipeline.rs to the new StreamEvent type (the
"dropped receiver does not abort" test inverts to record cancel).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:49:55 +09:00
th-kim0823
31475f0312 feat(rag): StreamEvent enum + switch AskOpts.stream_sink (fb-33)
3-variant discriminated enum (RetrievalDone / Token / Final).
AskOpts.stream_sink now carries StreamEvent. Other crates fail
to compile until subsequent tasks adapt their call sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:38:54 +09:00
th-kim0823
0ca9b1d5c3 plan(fb-33): streaming ask implementation plan
10 tasks: StreamEvent enum + AskOpts switch (kebab-core), pipeline
emits + cancel branch (kebab-rag), kebab-app re-exports, TUI
worker adapt, wire schema answer_event.v1, CLI --stream flag +
ndjson stderr driver + BrokenPipe cancel, integration tests
(Ollama-gated), workspace+clippy gate, docs, smoke+PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:16:42 +09:00
th-kim0823
4949775c8b spec(fb-33): streaming ask (ndjson delta) — design
3-variant StreamEvent enum (RetrievalDone / Token / Final) 을 통해
RagPipeline 이 retrieval / per-token / final 단계를 sink 로 발사.
CLI `kebab ask --stream` 이 ndjson event 를 stderr 로 흘리고 final
stdout line 은 기존 answer.v1 그대로 (ingest_progress.v1 패턴).
Cancel = stdout 닫힘 → SendError → LLM stream break +
RefusalReason::LlmStreamAborted 로 partial answer 기록.
MCP streaming 은 v0.5+ 별도 검토 (scope out).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:10:08 +09:00
877ad18f34 Merge pull request 'chore: bump version 0.3.3 → 0.4.0' (#123) from chore/bump-v0.4.0 into main
Reviewed-on: #123
2026-05-09 03:35:20 +00:00
th-kim0823
df42d8f621 chore: bump version 0.3.3 → 0.4.0
fb-32 머지로 wire schema 가 search_hit.v1 / citation.v1 의 required
필드를 두 개 (indexed_at, stale) 확장 — additive minor 로 분류했지만
strict validator 입장에서는 한 번 깨진 셈이라 minor bump.

surface 변경 (사용자 도그푸딩 영향):
- 모든 search hit / RAG citation 의 wire JSON 에 indexed_at (RFC3339) +
  stale (bool) 두 필드 추가
- CLI plain 출력 — stale doc 의 doc_path 옆에 [stale] tag (TTY = 노란색)
- TUI Search/Inspect/Ask pane — stale doc 의 doc_path 좌측에 [STALE] 배지
  (Theme::Warning role)
- config.toml [search] stale_threshold_days 신규 (default 30, 0 = 비활성)
- env KEBAB_SEARCH_STALE_THRESHOLD_DAYS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:32:37 +09:00
6a01f15261 Merge pull request 'feat(fb-32): per-hit + per-citation freshness indicators' (#122) from feat/fb-32-stale-doc-indicator into main
Reviewed-on: #122
2026-05-09 03:24:59 +00:00
th-kim0823
cb04bd8c8d fix(fb-32): address PR #122 round 2 review
- spec: add one-line cross-link to HOTFIXES entry per CLAUDE.md
  Spec-contract policy
- HOTFIXES: rename heading from "fb-32" to "p9-fb-32" matching
  the rest of the file's full-ID convention
- config: defensive assert before string-replace in negative TOML
  test guards against default-value drift causing unhelpful unwrap

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:19:19 +09:00
th-kim0823
efc6b7ebb0 fix(fb-32): address PR #122 round 1 review
- config: rename env-silent-ignore test + add file-load negative test
  asserting ConfigInvalid for negative TOML stale_threshold_days
- rag: add 5 boundary unit tests pinning compute_stale mirror equivalence
- search: rewrite "Task 6" plan refs in lexical/vector to point at
  actual function names (mark_stale_in_place / RagPipeline::ask)
- cli: dedupe write_config / ingest / backdate_updated_at helpers
  from wire_search_stale + wire_ask_stale into tests/common/mod.rs
- tui: clarify inspect.rs uses same source-of-truth as SearchHit
- rag: PackedCitation.stale invariant doc comment
- HOTFIXES: log conscious decision on wire-schema required-field
  expansion (strict-validator concern)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:04:28 +09:00
th-kim0823
1008bca342 docs(fb-32): README + SMOKE + INDEX + skill parsing tip
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:57:14 +09:00
th-kim0823
1f39b6bc2c feat(tui): [STALE] Warning-styled badge on search/inspect/ask (fb-32)
insta filter pattern '[indexed_at]' applied where snapshots
otherwise capture time-dependent RFC3339 strings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:43:05 +09:00
th-kim0823
aeee7ed771 feat(cli): [stale] tag on plain ask citations (fb-32)
Mirror of Task 9's search-output rendering: yellow [stale] on TTY,
plain text otherwise. JSON path inherits via serde on AnswerCitation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:34:58 +09:00
th-kim0823
15cdc97cae feat(cli): [stale] tag on plain search output (fb-32)
Yellow when TTY, plain when not. JSON path inherits via serde
on the domain type; no CLI-side wire change needed there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:24:54 +09:00
th-kim0823
cc41adabb5 feat(wire): search_hit.v1 + citation.v1 require indexed_at + stale (fb-32)
Additive minor — schema_version unchanged. Existing v1 consumers
that ignore unknown fields stay compatible; consumers that validate
strictly will reject pre-fb-32 payloads, which matches the wire
contract escape hatch (recipient version >= producer required).

Cross-task placeholders: kebab-eval / kebab-tui synthetic test
fixtures pin UNIX_EPOCH + stale=false (same pattern as
hybrid.rs / vector.rs). These don't exercise staleness — Task 11
adds dedicated TUI staleness rendering tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:17:15 +09:00
th-kim0823
16db60f7bd docs(fb-32): mirror back-reference + fix pipeline doc-comment ordering
- kebab-app::staleness::compute_stale gains note pointing at the
  kebab-rag mirror so future modifiers know to update both copies.
- kebab-rag::pipeline: doc comments adjacent to compute_stale and
  embedding_ref_for were positioned such that rustdoc would
  misattribute them. Reorder/separate so each comment hugs its
  own function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:51:40 +09:00
th-kim0823
e398272a24 feat(rag): AnswerCitation inherits indexed_at + stale from hit (fb-32)
pack_context widened to carry indexed_at + stale alongside marker
and Citation. LLM-citation construction site now plumbs real values
from upstream SearchHit instead of the Task 6 UNIX_EPOCH placeholder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:44:24 +09:00
th-kim0823
e891e487cf test(rag): mk_hit gains indexed_at + stale stubs (fb-32)
Test helper missed the SearchHit field expansion from fb-32 Task 1.
UNIX_EPOCH + false placeholders consistent with the cross-crate
synthetic-mock pattern (hybrid.rs, vector.rs build_hit Task 4 stub).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:37:19 +09:00
th-kim0823
dfef65f196 feat(app): staleness module + post-process search hits (fb-32)
compute_stale: strict > boundary, threshold=0 disables, future
timestamps treated as fresh (clock skew safety). App::search
re-stamps on cache hit so config threshold changes take effect
without flushing the cache.

Also unblocks the workspace build by plugging placeholder
indexed_at/stale into the two AnswerCitation construction
sites in kebab-rag/pipeline.rs (the score-gate refusal path
forwards from SearchHit; the LLM-citation path uses
UNIX_EPOCH/false until Task 7 wires the real values through
pack_context).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:30:10 +09:00
th-kim0823
8faad2f407 feat(search/vector): populate SearchHit.indexed_at (fb-32)
hydrate_chunks now JOINs d.updated_at. Hybrid fusion path is
unchanged (passes SearchHit through, fields preserved).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:17:54 +09:00
th-kim0823
f4ce6652b2 feat(search/lexical): populate SearchHit.indexed_at (fb-32)
JOIN documents.updated_at. stale defaults to false; App facade
post-processes against config threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:10:20 +09:00
th-kim0823
922849cd95 feat(config): search.stale_threshold_days (fb-32)
default 30 days. env override KEBAB_SEARCH_STALE_THRESHOLD_DAYS.
Malformed env values are silently ignored, matching the existing
apply_env pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:01:01 +09:00
th-kim0823
3a7a28e682 feat(core): AnswerCitation gains indexed_at + stale (fb-32)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:56:36 +09:00
th-kim0823
8b0f64db6b feat(core): SearchHit gains indexed_at + stale (fb-32)
Domain field additions for p9-fb-32. Wire serialization is
automatic via serde rfc3339. Other crates fail to compile until
they populate the new fields — fixed in subsequent tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:52:46 +09:00
th-kim0823
4728a87957 plan(fb-32): stale doc indicator implementation plan
15 tasks covering domain (kebab-core SearchHit + AnswerCitation),
config (SearchCfg.stale_threshold_days), retrievers (lexical + vector
JOIN documents.updated_at), App facade (staleness module + cache
re-stamp), wire schema, CLI plain [stale] tag, TUI [STALE] Warning
badge, snapshot fan-out, docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:40:46 +09:00
th-kim0823
401a47fb43 spec(fb-32): stale doc indicator — design
검색 hit / RAG citation 에 indexed_at + stale 두 wire 필드 추가.
documents.updated_at 재활용 (V006 incremental ingest 가 자연 source-of-truth).
config [search] stale_threshold_days = 30 default. additive minor wire.
TUI Warning role / CLI plain [stale] tag / agent --json 동시 surface.
자동 재 ingest 는 out of scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 18:00:10 +09:00
6d4a648349 Merge pull request 'docs(skill): MCP-first guidance + CLI as fallback' (#121) from docs/skill-mcp-first into main
Reviewed-on: #121
2026-05-07 14:46:40 +00:00
th-kim0823
b20c1dd56a docs(skill): MCP-first guidance + CLI as fallback
Repo-shipped skill previously framed CLI as primary surface with MCP
as a separate "recommended" alternative. Recipes still used CLI calls
even though MCP server has been the recommended surface since v0.3.1.

Now: MCP tool catalog (6 tools as `mcp__kebab__<name>`) leads, recipes
call MCP tools directly, CLI documented as fallback for hosts without
MCP. Generic wording preserved (no user-specific trigger keywords).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 23:45:25 +09:00
834a1e1723 Merge pull request 'fix(progress): one draw per file — drop set_message in TTY AssetStarted' (#120) from fix/progress-single-draw into main
Reviewed-on: #120
2026-05-07 13:29:56 +00:00
th-kim0823
3328760dca fix(progress): one draw per file — drop set_message in TTY AssetStarted
set_draw_target switching broke cursor positioning: each hidden→stderr
restore caused indicatif to draw a fresh line instead of overwriting.
Root fix: call only set_position() in TTY AssetStarted (one draw per
file). Filename visible in non-TTY plain-line output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 22:28:37 +09:00
f25e16f80c Merge pull request 'docs(tasks): mark fb-30 + fb-31 merged in INDEX.md' (#115) from docs/index-fb30-fb31-status into main
Reviewed-on: #115
2026-05-07 13:20:32 +00:00
4475abbf4f Merge pull request 'fix(progress): eliminate duplicate TTY frame per asset' (#119) from fix/progress-duplicate-tty-frame into main
Reviewed-on: #119
2026-05-07 13:17:34 +00:00
th-kim0823
5be90cffec fix(progress): eliminate duplicate TTY frame per asset
set_position() and set_message() each call update_and_draw()
independently, producing two scrollback lines per file in TTY mode.
Suppress the draw target before the two updates, restore to stderr,
then call tick() to emit exactly one frame.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 22:15:01 +09:00
6f0b2bcc37 Merge pull request 'fix(config): use XDG-standard paths on macOS (prevent DataOnly reset deleting config)' (#118) from fix/xdg-macos-path-collision into main
Reviewed-on: #118
2026-05-07 13:00:59 +00:00
th-kim0823
36fe7416c8 fix(config): use XDG-standard paths on macOS to prevent DataOnly reset deleting config
dirs::config_dir() and dirs::data_dir() both return ~/Library/Application Support
on macOS, so data_dir == config parent dir. ResetScope::DataOnly removes data_dir
and silently deletes config.toml along with it.

Fix: bypass dirs crate fallback for config/data/cache dirs; use
$HOME/.config, $HOME/.local/share, $HOME/.cache directly (XDG standard).
xdg_state_dir already used this pattern. dirs::home_dir() still used for
portability.

Migration: Config::load(None) auto-copies legacy ~/Library/Application
Support/kebab/config.toml to the new ~/.config/kebab/ on first run and
prints a migration notice to stderr.
2026-05-07 21:59:49 +09:00
d6e2e6273e Merge pull request 'fix(progress): eliminate duplicate bar frame per asset in TTY mode' (#116) from fix/progress-duplicate-tty-frame into main
Reviewed-on: #116
2026-05-07 12:50:31 +00:00
th-kim0823
cb266e0071 fix(progress): eliminate duplicate bar frame per asset in TTY mode
AssetStarted now advances position (idx-1) and sets message together.
AssetFinished no longer updates the bar — Completed handles final
cleanup via finish_and_clear. Result: one bar frame per file instead
of two, eliminating the scrollback duplicate-line artifact.
2026-05-07 21:49:47 +09:00
th-kim0823
ee15528acf docs(tasks): mark fb-30 + fb-31 merged in INDEX.md 2026-05-07 21:26:25 +09:00
e03b754a16 Merge pull request 'chore: bump version 0.3.2 → 0.3.3' (#114) from chore/bump-v0.3.3 into main
Reviewed-on: #114
2026-05-07 12:22:22 +00:00
th-kim0823
6b13d8e11f chore: bump version 0.3.2 → 0.3.3 2026-05-07 21:10:23 +09:00
fea91d5c99 Merge pull request 'feat(fb-26,fb-28): ingest log consistency + --readonly/--quiet flags + schema sync' (#113) from feat/p9-fb-26-fb-28-agent-ux into main
Reviewed-on: #113
2026-05-07 12:09:04 +00:00
th-kim0823
0e762e6374 fix: rename leftover kbkebab in main.rs comments 2026-05-07 20:52:34 +09:00
th-kim0823
b230fbb495 fix: apply review nits — kb→kebab comment, quiet reset guard, ingest-stdin readonly test, README+SMOKE docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 19:58:56 +09:00
th-kim0823
afbd64dafc docs: mark fb-26 + fb-28 merged, HOTFIXES entries for progress bugs + readonly_mode
- fb-26 (progress.rs): Fixed Aborted unconditional writeln (TTY duplicate output)
  and Completed TTY path missing summary line. Added KEBAB_PROGRESS=plain env
  override and quiet field to ProgressMode.
- fb-28 (main.rs): Added --readonly / --quiet global flags with KEBAB_READONLY env.
  Readonly blocks mutating commands (ingest/ingest-file/ingest-stdin/reset) with
  exit code 1; error.v1 code "readonly_mode" in --json mode. Quiet suppresses all
  human progress/hint stderr while preserving errors.
- Updated task spec status for p9-fb-26 and p9-fb-28 to 'merged'.
- Updated tasks/INDEX.md and HANDOFF.md with merge status and summary entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 19:45:40 +09:00
th-kim0823
6bedba4a7f test(fb-26,fb-28): integration tests for readonly/quiet flags and KEBAB_PROGRESS=plain
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 19:43:04 +09:00
th-kim0823
fd4125c0a0 feat(fb-28): --readonly/--quiet global flags + KEBAB_READONLY env + is_mutating guard
Add readonly/quiet fields to Cli, parse_bool_env for 1/true/yes/on support,
is_mutating guard that short-circuits with error.v1 on write-path commands,
and wire KEBAB_PROGRESS=plain through from_flags in the Ingest arm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 19:38:30 +09:00
th-kim0823
4191347491 fix(fb-26): Completed TTY missing summary + Aborted unconditional writeln + quiet suppression in handle_human
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 19:33:57 +09:00
th-kim0823
dd33902f5a feat(fb-26): extend ProgressMode with quiet field, update from_flags signature
Add `quiet: bool` to `Human` variant and expand `from_flags` to three
args (`json`, `quiet`, `plain_env`). Update `handle`/`handle_human`
accordingly; add four targeted unit tests (TDD).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 19:31:01 +09:00
th-kim0823
c8a8bc9045 docs: sync wire schema list in CLAUDE.md (remove phantom eval_run/eval_compare/list_docs, add chunk_inspection/citation/doc_summary) 2026-05-07 19:28:05 +09:00
th-kim0823
2de28c43da docs(plan): fb-26 + fb-28 + schema-sync implementation plan
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 19:18:27 +09:00
th-kim0823
9d96504bd9 docs(spec): fb-26 + fb-28 + schema-sync design doc
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 19:11:47 +09:00
6dcc5ce412 Merge pull request 'chore: bump version 0.3.1 → 0.3.2' (#112) from chore/bump-v0.3.2 into main
Reviewed-on: #112
2026-05-07 10:01:08 +00:00
th-kim0823
751377cae8 chore: bump version 0.3.1 → 0.3.2
fb-31 single-file/stdin ingest + MCP ingest tools + docs/mcp-usage.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 18:59:03 +09:00
4922f9fc64 Merge pull request 'feat(fb-31): single-file / stdin ingest — agent on-demand 저장' (#111) from feat/p9-fb-31-single-file-stdin-ingest into main
Reviewed-on: #111
2026-05-07 09:56:09 +00:00
th-kim0823
47bfd518c8 📝 docs: comprehensive MCP usage guide (fb-31)
신규 docs/mcp-usage.md (~280 line) — agent integration 의 종합 가이드:

- Quick start + `--config` thread 예시
- Host config 예시 (Claude Code / Cursor / OpenAI Agents / Copilot CLI)
- 6 tool catalog (search / ask / schema / doctor / ingest_file / ingest_stdin)
  각 tool 의 input shape, defaults, output 예시, "언제 사용", mutation
  주의사항.
- Troubleshooting — error.v1 의 7 code 별 조치 표 + grounded:false +
  doctor !ok + empty search + tool-not-found 시나리오.
- Multi-turn ask + session 관리 — session_id 명명, 새 session 시작
  시점, lifetime, single-shot vs session 비교.
- Performance / Security 절.

README.md 의 기존 MCP 절은 quick start 만 유지하고 docs/mcp-usage.md
링크. integrations/claude-code/kebab/SKILL.md 도 동일 cross-link.

agent 사용자 도그푸딩 후속 의견 — host-agnostic 가이드 + 명시적
troubleshooting 표 + multi-turn session 명명 컨벤션 부재 해소.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:53:59 +09:00
th-kim0823
7f5739d8fb 🏗️ refactor(fb-31): apply round 1 review nits
- ingest_file_with_config: lowercase normalize ext (caller-side) +
  early error on unsupported extension (`.docx` etc. now Err with
  helpful message instead of silent skipped_by_extension counter).
  New test ingest_file_errors_on_unsupported_extension.
- ingest_stdin_with_config: doc comment explaining intentional
  double-call of ensure helpers (idempotent + ~ms negligible).
- external::inject_frontmatter: simplify precheck via single
  trim_start binding + add CR-only line ending edge case.
- external::inject_frontmatter: doc note on yaml_quote escape
  contract (agent-supplied titles with special chars are safe).

Round 1 review summary: #111 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:46:55 +09:00
th-kim0823
dc24cb34b1 🚑 fix(fb-31): apply final review nits
- kebab-app: add #[doc(hidden)] to ingest_stdin_with_config (CLAUDE.md
  convention — all *_with_config functions should have this attribute;
  fb-31's first impl missed it on the second facade fn).
- SKILL.md: "Since v0.4.0" → "Since v0.3.1" (MCP shipped in fb-30
  release v0.3.1; the wrong version claim was introduced in fb-30 doc
  sync and carried forward into fb-31).
- tools_call_ingest_file: add idempotency test (second call with same
  content → unchanged=1, new=0). Spec called for two tests; first impl
  shipped only the happy path.

Version bump 0.3.1 → 0.3.2 deferred to separate `chore/bump-v0.3.2` PR
mirroring fb-27 + fb-30 precedent (commits 73f5d73 / 5495d96).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:38:19 +09:00
th-kim0823
ccee30037d 🧪 test(kebab-cli): update cli_mcp_smoke tools/list assertion 4 → 6 (fb-31)
fb-31 added ingest_file + ingest_stdin MCP tools (Task 9) but the
spawn-based smoke test in cli_mcp_smoke.rs still asserted the fb-30
count of 4. Bump to 6 to match the live tools/list response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:26:51 +09:00
th-kim0823
e041173e8e 📝 docs(tasks): HOTFIXES entry + p9-fb-31 status → completed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:18:50 +09:00
th-kim0823
345a4f363a 📝 docs: sync README / HANDOFF / CLAUDE / skill / design for fb-31
- README 명령 표 에 `kebab ingest-file` + `kebab ingest-stdin` 두 row + MCP tool list 4 → 6.
- HANDOFF post-도그푸딩 항목 한 줄.
- CLAUDE.md `_external/` 디렉토리 + naming convention 한 줄.
- integrations skill — Recipe D (agent fetched web doc) + MCP tool list 갱신.
- design §6.7 `_external/` subdirectory 절 신설.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:16:45 +09:00
th-kim0823
71c2bbdc97 🧪 test(kebab-mcp): ingest_file + ingest_stdin integration (fb-31)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:14:07 +09:00
th-kim0823
ecd77290cd feat(kebab-mcp): ingest_file + ingest_stdin tools (fb-31)
5th + 6th MCP tools — first mutation surface (fb-30 v1 was read-only).
Both wrap the new kebab-app facade fns + use spawn_blocking via the
existing spawn_tool helper. tools/list now returns 6 tools.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:12:18 +09:00
th-kim0823
fbc01eda50 🧪 test(kebab-cli): cli_ingest_file + cli_ingest_stdin integration (fb-31)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:10:09 +09:00
th-kim0823
0386adcb5e feat(kebab-cli): kebab ingest-stdin subcommand (fb-31)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:07:35 +09:00
th-kim0823
9cc7deca11 feat(kebab-cli): kebab ingest-file subcommand (fb-31)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:06:25 +09:00
th-kim0823
a42f907640 🧪 test(kebab-app): ingest_stdin_with_config integration (fb-31)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:04:52 +09:00
th-kim0823
67050016cc feat(kebab-app): ingest_stdin_with_config facade (fb-31)
Wraps body with YAML frontmatter (title + source_uri) via
crate::external::inject_frontmatter, writes to
_external/<hash12>.md, delegates to ingest_file_with_config. Markdown
only in v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:04:01 +09:00
th-kim0823
73ee64c73f 🧪 test(kebab-app): ingest_file_with_config integration (fb-31)
Three scenarios — copies external md + reports new=1, idempotent on
second call (unchanged=1), errors on missing path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:02:52 +09:00
th-kim0823
9b53dcb94f feat(kebab-app): ingest_file_with_config facade (fb-31)
Single-file ingest entry. Copies bytes to _external/<hash12>.<ext>
via crate::external::copy_to_external, runs the per-medium pipeline on
that single asset (reuses ingest_with_config_opts via a SourceScope
{ root: _external/, include: [<filename>], exclude:
config.workspace.exclude }).

`.kebabignore` matches log a stderr warn line and proceed (explicit
ingest is bypass intent). Internal helper `check_kebabignore_match`
uses the `ignore` crate's GitignoreBuilder.

Returns the standard IngestReport (incremental ingest from fb-23
handles re-ingest as `unchanged`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:01:18 +09:00
th-kim0823
41061a38ac 🏗️ feat(kebab-app): external module — _external dir + frontmatter inject (fb-31)
Pure-fn helpers for the `_external/` workspace subdirectory:
- `ensure_external_dir(workspace_root)` — mkdir if absent
- `ensure_kebabignore_entry(workspace_root)` — append `_external/` line
  to .kebabignore if missing (idempotent)
- `copy_to_external(ext_dir, bytes, ext)` — write to
  `<ext_dir>/<blake3-12>.<ext>`, idempotent on same content
- `inject_frontmatter(body, title, source_uri?)` — prepend YAML block
  with strict double-quote escaping; errors if body already starts
  with `---`
- `yaml_quote(s)` — defensive escaping for agent-supplied strings

14 unit tests cover happy + idempotency + edge (CRLF frontmatter
detection, YAML escape).

ingest_file / ingest_stdin facades (Tasks 2-5) compose these.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:58:31 +09:00
177ce21f88 Merge pull request 'docs(fb-31): single-file / stdin ingest spec + implementation plan' (#110) from spec/p9-fb-31-single-file-stdin-ingest into main
Reviewed-on: #110
2026-05-07 08:54:56 +00:00
th-kim0823
b7c85e8887 📝 docs(plan): p9-fb-31 single-file / stdin ingest implementation plan
12-task plan covering:
- kebab-app::external module (4 helpers + 12 unit tests) — Task 1
- kebab-app::ingest_file_with_config facade — Task 2
- kebab-app integration test — Task 3
- kebab-app::ingest_stdin_with_config facade — Task 4
- kebab-app integration test — Task 5
- kebab-cli Cmd::IngestFile + Cmd::IngestStdin arms — Tasks 6 + 7
- kebab-cli spawn-based integration tests — Task 8
- kebab-mcp ingest_file + ingest_stdin tools (4 → 6) — Task 9
- kebab-mcp integration tests — Task 10
- doc sync (README + HANDOFF + CLAUDE + skill + design §6.3) — Task 11
- HOTFIXES + status flip + final verification — Task 12

Implementation strategy: ingest_file_with_config copies bytes to
_external/<hash>.<ext> then delegates to existing
ingest_with_config_opts via SourceScope { root: _external/, include:
[<filename>], ... } — minimal change to existing walk pipeline.
ingest_stdin_with_config = frontmatter inject + ingest_file delegation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:49:04 +09:00
th-kim0823
7772fbc00f 📝 docs(spec): p9-fb-31 single-file / stdin ingest 설계 문서
신규 명령 `kebab ingest-file` + `kebab ingest-stdin` + MCP tool
`ingest_file` + `ingest_stdin` 도입 brainstorm 산출물. agent fetch 한
web markdown / 단일 외부 file 을 KB 에 즉시 저장.

핵심 결정:
- 외부 file 저장: copy in (`<workspace.root>/_external/<hash12>.<ext>`).
  blake3 content hash 기반 deterministic 명명 → idempotent.
- CLI: 신규 subcommand 2개 (기존 `kebab ingest` 무영향).
- MCP: 4 → 6 tool. fb-30 v1 read-only 정책 변경 — 첫 mutation tool
  surface (의도된 진화).
- .kebabignore: explicit ingest 가 default bypass + stderr warn.
- stdin v1: markdown 전용 + flag (--title, --source-uri) → frontmatter
  자동 prepend. 이미 frontmatter 있으면 error (use ingest-file).
- `_external/` 디렉토리 첫 생성 시 .kebabignore 자동 append (walk
  re-ingestion 무한 루프 방지).
- source_uri 는 frontmatter → Document.metadata 자동 흐름. wire
  schema 변경 없음 (ingest_report.v1 / search_hit.v1 의 metadata
  free-form map 재사용).

릴리스: 0.3.1 → 0.3.2 patch — additive only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:29:30 +09:00
138f31d661 Merge pull request 'chore: bump version 0.3.0 → 0.3.1 (fb-30 release)' (#109) from chore/bump-v0.3.1 into main
Reviewed-on: #109
2026-05-07 08:07:47 +00:00
th-kim0823
5495d96275 chore: bump workspace version 0.3.0 → 0.3.1 — fb-30 release
agent integration MVP (fb-30 MCP server stdio) 머지 후 patch bump.

trigger:
- 신규 CLI surface `kebab mcp` (additive, 기존 명령 동작 무영향)
- new crate `kebab-mcp` (lib only, transitive 만)
- capability flag `mcp_server: false → true` (additive on schema.v1)
- design §10.2 MCP transport 절 추가 (additive subsection)

CLAUDE.md release 규약상 wire schema additive / surface 추가는
minor bump 영역이지만, pre-1.0 (0.x.y) 단계에서는 fb-26~31 group
누적 시까지 0.3.x patch 로 묶고, 큰 jump 시 0.4.0 cut 으로 운용.
fb-30 단독으로 break 없는 additive 라 0.3.1 patch 적정.

후속 fb-26 / 28 / 31 머지 시점에 추가 0.3.x patch 누적, breaking
schema 변경 시 0.4.0 minor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:06:45 +09:00
eb4f594dda Merge pull request 'feat(fb-30): MCP server (stdio) — agent integration MVP' (#108) from feat/p9-fb-30-mcp-server into main
Reviewed-on: #108
2026-05-07 08:04:53 +00:00
th-kim0823
4e2090e54d 🏗️ refactor(fb-30): apply round 1 review nits
- error_wire.rs: extract `pub const ERROR_V1_ID = "error.v1"` + replace
  9 inline literals (parallel to schema.rs::SCHEMA_V1_ID pattern).
  Re-export via kebab-app::lib.rs.
- kebab-mcp/src/lib.rs: extract `KebabHandler::spawn_tool<I, F>` helper —
  search + ask arms reduce from ~17 lines each to a one-line dispatch.
  Future tool 추가 시 boilerplate 안 늘림.
- ask.rs: defensive `to_value(&answer)` — silent Null 위험 제거, 실패
  시 to_tool_error fallthrough.
- HOTFIXES: note AskOpts Default 미도입 limitation.
- ARCHITECTURE.md: directory tree 의 kebab-mcp 항목에 `schema` 추가
  (4 tool 모두 명시).

Round 1 review summary: #108 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:59:40 +09:00
th-kim0823
2387c6cd11 🚑 docs + cleanup: ARCHITECTURE.md kebab-mcp + state.rs stale comment + schema test cap-flag (fb-30)
Final review fixes:

- docs/ARCHITECTURE.md: add kebab-mcp to UI subgraph + directory tree
  (CLAUDE.md "add new crate" rule required this; missed in Task 12 doc sync).
- state.rs: replace forward-reference Task 10 comment with current-state
  doc (config_path now wired by Task 10 commit 4a30959).
- tools_call_schema.rs: assert capabilities.mcp_server == true (already
  pinned in schema_report + cli_schema, this closes the gap in mcp's own
  test).

Version bump 0.3.0 → 0.4.0 deferred to separate `chore/bump-v0.4.0` PR
mirroring fb-27 precedent (commit 73f5d73 / PR #105).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:34:46 +09:00
th-kim0823
ee4f198308 📝 docs(tasks): HOTFIXES entry + p9-fb-30 status → completed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:17:37 +09:00
th-kim0823
f758d51a01 📝 docs: sync README / HANDOFF / CLAUDE / skill / design for fb-30
- README 명령 표 에 `kebab mcp` 추가 + Claude Code MCP config 예시
- HANDOFF post-도그푸딩 항목 한 줄 (rmcp 1.6 + manual dispatch + error_wire promotion + ask/search spawn_blocking + capability flag flip 명시)
- CLAUDE.md facade 룰 의 UI crate 카테고리 에 `kebab-mcp` 추가
- integrations skill — MCP 사용 안내 (recommended over subprocess)
- design §10.2 MCP transport 절 신설

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:15:19 +09:00
th-kim0823
366b647a1a feat(kebab-app): capability flag mcp_server: false → true (fb-30)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:12:23 +09:00
th-kim0823
4a30959fdd feat(kebab-cli): kebab mcp subcommand (fb-30)
Wires kebab_mcp::serve_stdio into kebab-cli. `--config <path>` honored
via the established Config::load pattern.

Updated serve_stdio signature to (Config, Option<PathBuf>) so the doctor
tool's path-aware behavior works correctly via KebabAppState.

Smoke test spawns the binary + sends initialize + initialized +
tools/list over stdin, asserts 4 tools returned. Confirms the MCP
server boots end-to-end via the real binary (rmcp 1.6 has no
in-memory test transport, so this is the only end-to-end assertion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:10:17 +09:00
th-kim0823
61eef9bc82 🧪 test(kebab-mcp): tools/list returns 4 tools (fb-30)
Approach: extracted `pub fn build_tools_vec() -> Vec<Tool>` from the
inline `list_tools` trait impl body — `RequestContext<RoleServer>` is
non-constructible from outside rmcp (Peer::new is pub(crate)), so a
direct trait-method call was not viable without an in-memory transport
rmcp 1.6 does not expose. The helper is the single source of truth;
`list_tools` now delegates to it.

Three test cases in tests/tools_list.rs:
- 4 tools present with correct names
- search inputSchema has "required": ["query"]
- schema/doctor tools accept empty input (type=object, no required)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:04:42 +09:00
th-kim0823
bc16dbf12a 🚑 fix(kebab-cli): add schema_version field to wire.rs ErrorV1 test literal
Task 8 commit f9a1548 added `schema_version: String` as required field on
ErrorV1 (so kebab-mcp's direct serialize-then-emit path produces correct
error.v1 wire). The wire.rs ErrorV1 literal in the
error_wrapper_tags_schema_version_and_emits_code test was missed —
breaks kebab-cli build. Add the field to the test fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:02:09 +09:00
th-kim0823
f9a1548b53 🧪 test(kebab-mcp): error mapping — bad config → error.v1 (fb-30)
Adds integration test schema_tool_emits_error_v1_when_db_missing that
verifies NotIndexed errors are emitted as error.v1 JSON with isError=true.
Also fixes ErrorV1 struct to include required schema_version field per
error.v1 wire contract (docs/wire-schema/v1/error.schema.json).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:58:52 +09:00
th-kim0823
c8e04c65e0 🏗️ refactor(kebab-mcp): fix ask tool production panic + mode default (fb-30)
Two issues from Task 7 review:

1. CRITICAL — call_tool "ask" arm called blocking ask handle from async
   context. OllamaLanguageModel::new builds reqwest::blocking::Client
   which creates+drops a tokio runtime → panic inside async. Fix:
   tokio::task::spawn_blocking wrap. Also applied preemptively to
   "search" arm (SqliteStore + Lance open are blocking IO too).

2. IMPORTANT — ask tool's retrieval mode hardcoded to Lexical (test
   workaround for provider="none"); CLI default is Hybrid. Fix: add
   `mode: Option<String>` field to AskInput, default Hybrid in handle,
   test passes mode=Some("lexical") explicitly to keep test functional
   on provider="none".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:55:02 +09:00
th-kim0823
4b1b8a15bf feat(kebab-mcp): ask tool (fb-30)
Fourth (and final v1) tool — `ask` (input: query / optional session_id).
Multi-turn via optional session_id (kebab_app::ask_with_session_with_config),
single-shot via ask_with_config when None. Refusal (grounded:false) NOT
mapped to isError — agent branches on the wire payload's grounded flag.

AskOpts has no Default impl (must construct manually). Answer carries no
schema_version field (tagged inline via entry().or_insert_with, idempotent).
Mode defaulted to Lexical: reqwest::blocking::Client::build creates and
drops a tokio runtime, panicking inside async context — the empty-corpus
refusal test avoids this via spawn_blocking; the tool itself uses Lexical
as the default mode since MCP callers typically run without an embedding
provider configured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:50:57 +09:00
th-kim0823
52782fdf72 feat(kebab-mcp): search tool (fb-30)
Third tool — `search` (input: query / mode / k). First tool with
non-empty input — establishes the pattern: SearchInput struct with
JsonSchema derive + Tool::new uses
rmcp::handler::server::common::schema_for_type::<SearchInput>() for
inputSchema + call_tool match arm parses request.arguments via
serde_json::from_value.

search_with_config takes owned Config, so state.config (Arc<Config>)
is cloned via (*state.config).clone(). Output: search_hit.v1 array —
SearchHit (kebab-core) does not carry schema_version field, so each
element is tagged inline before serialising.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:44:56 +09:00
th-kim0823
360fa53b02 feat(kebab-mcp): doctor tool (fb-30)
Second tool — `doctor` (no input args, returns doctor.v1 JSON via
kebab_app::doctor_with_config_path). Mirrors schema tool's manual-dispatch
pattern: Tool::new entry in list_tools, match arm in call_tool, per-tool
module in tools/doctor.rs.

doctor_with_config_path takes Option<&Path> (not &Config), so KebabAppState
is extended with config_path: Option<PathBuf>. All existing callers
(initialize.rs, tools_call_schema.rs, serve_stdio_async) pass None for now;
Plan Task 10 (Cmd::Mcp wiring) will thread the actual --config path through.
doctor_with_config falls back to XDG default when config_path is None —
same behavior as bare `kebab doctor`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:41:13 +09:00
th-kim0823
8ca8e18d12 feat(kebab-mcp): schema tool (fb-30)
First tool wired — `schema` (no input args, returns schema.v1 JSON
mirroring `kebab schema --json`). Establishes the per-tool module
pattern (crates/kebab-mcp/src/tools/<name>.rs) + error helper that maps
anyhow::Error to MCP CallToolResult.error with error.v1 content.

Dispatch pattern: manual dispatch — explicit `list_tools` + `call_tool`
overrides on `impl ServerHandler for KebabHandler` with a
`match request.name.as_ref()` arm per tool. No proc-macro magic.
Tasks 5-7 should add a new arm + new tools/<name>.rs following the same
pattern; also add a `Tool::new(...)` entry in `list_tools`.

API shapes confirmed from rmcp 1.6 source:
- Content = Annotated<RawContent>; text via `Content::text(s)`; pattern
  match via `&content.raw` → `RawContent::Text(t)` → `t.text`
- CallToolResult::success(Vec<Content>) / ::error(Vec<Content>)
- ListToolsResult::with_all_items(Vec<Tool>)
- schema_for_empty_input() from rmcp::handler::server::common

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:38:00 +09:00
th-kim0823
8f6e6bc01a feat(kebab-mcp): handler skeleton + initialize handshake (fb-30)
KebabHandler implements rmcp::ServerHandler::get_info — returns
serverInfo (name="kebab", version from CARGO_PKG_VERSION) and
capabilities.tools. KebabAppState wraps Config in Arc for cheap clone
into per-request task scope. serve_stdio entry builds a multi-thread
tokio runtime and runs the server until client closes the stream.

rmcp 1.6 API used:
- rmcp::ServerHandler trait (re-exported from handler::server)
- ServerInfo::new(caps).with_server_info(impl) builder (not struct-init:
  InitializeResult/Implementation are #[non_exhaustive])
- ServerCapabilities::builder().enable_tools().build() — builder macro
  generated, confirms the plan-literal pattern works
- Implementation::new(name, version) — non-exhaustive constructor
- rmcp::transport::stdio() returns (tokio::io::Stdin, tokio::io::Stdout)
  tuple; tuple impls IntoTransport via AsyncRead+AsyncWrite blanket
- handler.serve(transport).await → RunningService<RoleServer, H>
  (ServiceExt::serve, returns Result<_, ServerInitializeError>)
- service.waiting().await → Result<QuitReason, JoinError>
- serve_stdio is plain fn wrapping a manually-built tokio runtime
  (avoids nested-runtime hazard if kebab-cli ever gains its own rt)

Tools wire-up lands in subsequent tasks (one tool per task).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:31:13 +09:00
th-kim0823
2c09ed6af4 🏗️ chore(kebab-mcp): scaffold new crate (fb-30)
Empty lib + serve_stdio entry that bails until Task 3 wires rmcp. Adds
rmcp 1.6 to workspace dependencies (server + macros + transport-io +
schemars features) + tokio multi-thread/io-util/io-std local extensions.

schemars declared as "1" (resolved to 1.2.1) — matches rmcp 1.6's ^1.0
requirement (verified via crates.io /dependencies; plan literal was 0.9
which would conflict). Path-style refs for kebab-app / kebab-config /
kebab-core follow workspace convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:25:57 +09:00
th-kim0823
1f53930234 🏗️ refactor(kebab-app): promote error_classify → kebab-app::error_wire (fb-30 prep)
fb-30 의 새 crate `kebab-mcp` 가 동일 classify 모듈 사용 — UI crate 끼리
import 는 facade rule 위반이므로 kebab-app 으로 promotion. fb-27 commit
c91228e 의 코드 그대로 이전 (struct + classify + classify_llm + 7 unit
test). reqwest dev-dep 도 함께 이동.

kebab-cli 는 `kebab_app::ErrorV1` / `kebab_app::classify` 로 import 경로
1줄 변경 + wire.rs 의 `&crate::error_classify::ErrorV1` 1줄 교체. 동작
무영향.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:13:28 +09:00
65cfaf7c75 Merge pull request 'docs(fb-30): MCP server (stdio) spec + implementation plan' (#107) from spec/p9-fb-30-mcp-server into main
Reviewed-on: #107
2026-05-07 06:07:41 +00:00
th-kim0823
72855df98b 📝 docs(plan): p9-fb-30 MCP server implementation plan
14-task plan covering:
- error_classify → kebab-app::error_wire promotion (Task 1)
- new crate kebab-mcp scaffold (Task 2)
- KebabHandler skeleton + initialize (Task 3)
- 4 tool wire-up: schema / doctor / search / ask (Tasks 4-7)
- error mapping test (Task 8)
- tools/list integration (Task 9)
- kebab-cli Cmd::Mcp + spawn smoke (Task 10)
- capability flag flip (Task 11)
- doc sync (Task 12)
- HOTFIXES + status flip (Task 13)
- final workspace verification (Task 14)

rmcp 1.6 SDK 채택. plan 의 macro / extractor 시그니처는 best-effort —
실제 rmcp 1.6 API 와 다르면 Task 3 부터 hand-roll fallback 명시.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:54:43 +09:00
th-kim0823
58ec4578b9 📝 docs(spec): p9-fb-30 MCP server (stdio) 설계 문서
`kebab mcp` 신규 subcommand + new crate `kebab-mcp` 도입을 위한
brainstorm 산출물. agent integration "MVP" 완성 (Claude Code / Cursor /
OpenAI Agents 등 host-agnostic 사용 가능).

핵심 결정:
- `kebab mcp` subcommand (kebab-cli 내, 신규 binary 아님)
- 4 read-only tool (`search` / `ask` / `schema` / `doctor`) — ingest /
  fetch / list_docs / inspect_chunk 는 fb-31 / fb-35 / 후속에서 추가
- Resources / Prompts 모두 skip (tools only)
- Rust MCP SDK 사용 — rmcp 채택 우선, plan 단계 verify
- stdio 단일 transport — fb-29 deferral 따라 HTTP-SSE P+
- error mapping: tool dispatch 실패만 isError=true + error.v1 content,
  refusal / no-hit / unhealthy 는 정상 응답 (semantic flag 으로 분기)
- classify 모듈 이전: kebab-cli::error_classify → kebab-app::error_wire
  (kebab-cli + kebab-mcp 둘 다 동일 모듈 사용, facade 룰 준수)
- capability flag `mcp_server` false → true

릴리스: 0.3.0 → 0.4.0 minor — 신규 surface + new crate + 디자인 §10.1
변경 + capability flip 모두 trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:44:09 +09:00
769139996b Merge pull request 'docs: defer fb-29 HTTP daemon + fb-30 stdio-only' (#106) from chore/fb-29-defer into main
Reviewed-on: #106
2026-05-07 04:59:36 +00:00
th-kim0823
2e8de14434 📝 docs: defer fb-29 HTTP daemon + fb-30 stdio-only
2026-05-07 brainstorm 결정 — fb-29 HTTP daemon defer.

근거:
- single-user local-first 환경에서 daemon 복잡도 (PID file / port lock /
  single-instance / lifecycle UX / loopback security) 가 비대.
- fb-30 stdio MCP 가 동일 사용자 가치 (agent integration + session 동안
  hot cache) 를 daemon 없이 제공 — agent host (Claude Code, Cursor 등)
  가 subprocess 띄우면 session 동안 process 유지 = 자연스러운 hot state.
- ask 의 dominant cost 는 Ollama LLM 추론 (수 초) 라 daemon 효과 제한적.
  search 만 의미 있는데 그 비용도 fastembed model load (~1s) 가 dominant —
  stdio MCP subprocess 가 동일하게 회피.

후속 fb-30 의 transport 는 stdio 단일. HTTP-SSE 옵션은 future task —
재개 trigger:
- browser agent / remote multi-host 시나리오 등장.
- TUI ↔ CLI 다중 인스턴스 state sharing 요구.
- fb-30 MCP HTTP-SSE 변형 도입 검토.

영향:
- fb-29 spec frontmatter `status: open` → `deferred`,
  `target_version: 0.3.0` → `P+`, `unblocks: [p9-fb-30]` → `[]`.
- fb-30 `depends_on: [p9-fb-27, p9-fb-29]` → `[p9-fb-27]`.
- INDEX.md 의 0.3.0 group + HANDOFF 의 release 그룹 표 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 13:58:02 +09:00
50ad7f1f28 Merge pull request 'chore: bump version 0.2.1 → 0.3.0 (fb-27 release)' (#105) from chore/bump-v0.3.0 into main
Reviewed-on: #105
2026-05-07 04:48:53 +00:00
th-kim0823
73f5d73112 chore: bump workspace version 0.2.1 → 0.3.0 — fb-27 release
agent foundation MVP 첫 component (fb-27 introspection + structured
error wire) 머지 후 minor bump. trigger:

- frozen design contract 변경 (§10.1 capability matrix subsection 신설)
- wire schema additive (`schema.v1` + `error.v1` 신규)
- 신규 CLI surface (`kebab schema`)

이전 binary (v0.2.1) 는 새 wire schema / error wire 인지 못 함 — 0.3.0
부터 agent integration (fb-30 MCP, fb-29 daemon, fb-28 readonly/quiet 등)
의 capability discovery + structured error 활용 가능.

후속 fb-26 / 28 / 29 / 30 / 31 머지는 0.3.x patch / 0.4.0 누적 예정.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 13:38:05 +09:00
c732189eb3 Merge pull request 'feat(fb-27): introspection (kebab schema) + structured error wire' (#104) from feat/p9-fb-27-introspection-and-error-wire into main
Reviewed-on: #104
2026-05-07 04:19:40 +00:00
th-kim0823
1bcca7f9ca 🏗️ refactor(fb-27): apply round 1 review nits
- schema.rs: extract `SCHEMA_V1_ID` const + re-export via kebab-app::lib.rs.
  wire.rs::wire_schema 의 2 literal 도 import 해서 single source of truth.
- schema.rs::collect_models: parser_version 가 markdown 만 surface 함을
  주석으로 명시 (PDF/image extractor 의 자체 version 은 SchemaV1.models 가
  multi-medium map 으로 진화 시 surface).
- main.rs::print_schema_text: 헤더 줄 끝의 `\n` 제거 + `println!()` 추가 —
  다른 section 들과 패턴 일관.
- error_classify.rs::llm_unreachable_classifies: timeout 50ms → 500ms (10x
  headroom) + 접근 방식 + 한계 주석 추가.
- HOTFIXES: open_existing 의 RW flag + 주석-only enforcement 갭을
  Known-limitation 에 명시.

Round 1 review summary: #104 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 13:17:50 +09:00
th-kim0823
f7d6ea593e 📝 docs: fix capability flag names in skill + document not_indexed details deviation (fb-27)
- SKILL.md: `streaming_ingest` → `streaming_ask`, `multi_turn` → `rag_multi_turn` (capability name mismatch flagged in final review — agents following the example literal would read non-existent fields).
- HOTFIXES.md: add `not_indexed.details` to the interim wire shape deviations list — emit `{ expected, found }` only (spec literal `{ data_dir, expected, found }` not honored because NotIndexed signal carries one full path, not separate data_dir).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:58:17 +09:00
th-kim0823
cc0e4e6551 📝 docs(tasks): HOTFIXES entry + p9-fb-27 status → completed
HOTFIXES 항목이 fb-27 의 live binding 변경 + interim wire shape
deviation 의 source of truth (error.v1.details 가 신규 typed signal
도입 전까지 spec literal 과 일부 일탈).

spec 상단 banner 와 frontmatter status 가 frozen 상태 + post-merge
HOTFIXES cross-link 으로 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:40:38 +09:00
th-kim0823
bb7b1cec4b 📝 docs: sync README / HANDOFF / CLAUDE / skill / design for fb-27
- README 명령 표 에 `kebab schema` 추가
- HANDOFF post-도그푸딩 항목 한 줄
- CLAUDE.md wire schema 절 schema.v1 / error.v1 추가
- integrations skill — schema 활용 안내 (additive)
- design §10.1 capability matrix subsection 신설

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:37:40 +09:00
th-kim0823
c25f4f89e3 📝 docs(wire-schema): schema.v1 + error.v1 JSON Schema (fb-27)
schema.v1: full introspection report shape with required fields for
wire / capabilities / models / stats. capabilities object enumerates
all 10 flag names (current 6 true + future 4 false) as required keys.

error.v1: 7-code enum + permissive details object. Real emitted
details shapes documented in description (per-code context varies and
some fields are interim until IoFailure / OpTimeout typed signals
land in follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:35:15 +09:00
th-kim0823
3725986af7 🧪 test(kebab-cli): integration coverage for kebab schema + error.v1 (fb-27)
cli_schema: exercises `kebab schema` (text + --json) on a fresh-but-init'd
KB. Pins schema_version, kebab_version non-empty, capabilities.json_mode
true, capabilities.mcp_server false (future placeholder).

cli_error_wire: spawns `kebab --json --config <malformed.toml> ingest`
and verifies stderr emits a single error.v1 ndjson line with
code == "config_invalid". Non-JSON mode regression-pinned to keep the
legacy `error:` prefix. Note: --config /nonexistent silently falls back
to defaults (by design); a file that exists but fails TOML parsing is
the reliable trigger for config_invalid.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:33:46 +09:00
th-kim0823
912c7aa07d feat(kebab-cli): emit error.v1 ndjson on stderr in --json mode (fb-27)
Wraps the existing `Err(e)` arm with a `cli.json` branch:
- `--json`: stderr ndjson `error.v1` via wire_error_v1
- non-`--json`: legacy `error: <msg>` text path (unchanged)

exit_code() unchanged — RefusalSignal/NoHitSignal/DoctorUnhealthy
still drive 1/1/3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:28:41 +09:00
th-kim0823
4eb13c63ae feat(kebab-cli): kebab schema subcommand (fb-27)
Text mode: doctor-style key/value layout. JSON mode: schema.v1 wire
record. Honors `--config <path>` via the established
`kebab_app::schema_with_config(&cfg)` facade pattern (per the P3-5 /
P4-3 regression conventions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:24:06 +09:00
th-kim0823
c91228e7d5 feat(kebab-cli): error_classify dispatcher + wire helpers (fb-27)
`error_classify::classify` maps anyhow::Error → ErrorV1 wire record by
downcasting to known typed errors (LlmError + ConfigInvalid + NotIndexed
re-exported from kebab_app::error_signal, plus std::io::Error chain).
Generic fallback emits `code: "generic"` with the chain in `details` when
verbose.

wire.rs adds wire_schema (idempotent re-tag, mirrors wire_doctor pattern
since SchemaV1 carries its own schema_version field) and wire_error_v1
(simple tag_object). Tests pin both wrappers + 7 classify code paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:11:54 +09:00
th-kim0823
3e33daaa9b 🧪 test(kebab-app): assert chunk_count + asset_count in schema_report (fb-27)
Plug coverage hole flagged in code review — test 1 was asserting only
doc_count + last_ingest_at, leaving count_summary's chunk_count and
asset_count queries un-pinned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:06:04 +09:00
th-kim0823
ab96335174 🧪 test(kebab-app): schema_with_config integration coverage (fb-27)
Two scenarios: freshly-ingested 2-doc KB (stats reflect counts +
last_ingest_at populated) and empty-but-initialized KB (counts zero,
last_ingest_at None). The empty case runs ingest_with_config over an
empty workspace dir to seed kebab.sqlite before calling schema_with_config,
since open_existing (used internally) returns NotIndexed if the DB is absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:02:20 +09:00
th-kim0823
61aae1c1d5 🏗️ refactor(kebab-app): consolidate PARSER_VERSION + clarify intent (fb-27)
Replace kebab-app's private `KEBAB_PARSE_MD_VERSION` literal with a
direct reference to `kebab_parse_md::PARSER_VERSION` so the parser
version cascade has a single source of truth (design §9 invariant).

Add maintenance comment on schema.rs WIRE_SCHEMAS const pointing to
docs/wire-schema/v1/ + kebab-cli wire helpers as the authoritative
sources to keep in sync.

Tighten open_existing doc comment to match the actual SQLITE_OPEN_READ_WRITE
flag (needed for WAL pragma application) — callers should still avoid
issuing mutations through this connection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:58:06 +09:00
th-kim0823
39b4433549 feat(kebab-app): schema_with_config facade (fb-27)
New `SchemaV1` struct + `schema_with_config(&Config)` builder. Surfaces
wire schemas list, capabilities (current + future placeholders), model
versions (parser/chunker/embedding/prompt_template/index/corpus_revision),
and stats (doc/chunk/asset counts + last ingest). kebab-store-sqlite
gains `count_summary()` to back the stats block.

Deviations from plan:
- `cfg.models.embedding.id` → `cfg.models.embedding.model` (actual field name)
- No `Config::expand_path` method → free fn `kebab_config::expand_path(&cfg.storage.data_dir, "")`
- `PARSER_VERSION` added to `kebab-parse-md/src/lib.rs` (was absent; synced with `KEBAB_PARSE_MD_VERSION` literal in kebab-app)
- `INDEX_VERSION_STR` added to `kebab-store-vector/src/store.rs` + re-exported from `lib.rs` (was a private `const`)
- `corpus_revision()` returns `u64` directly (not `Result<u64>`) — no `?` in collect_models
- `SchemaV1` carries `schema_version: "schema.v1"` field (wire schema convention)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:46:37 +09:00
th-kim0823
1c4d554bf4 🏗️ refactor(kebab-store-sqlite): harden open_existing against silent create (fb-27)
Replace `path.exists()` + `Connection::open` (which silently CREATEs on
race) with `Connection::open_with_flags` using READ_WRITE|URI but NOT
CREATE. SQLite surfaces `SQLITE_CANTOPEN` for missing files; we wrap as
NotIndexed { found: None } as before.

Adds open_existing_does_not_create_missing_db regression test pinning
the no-side-effect invariant.

Also documents read-only intent on open_existing, the format contract
on NotIndexed.found, and removes scaffolding comments from kebab-app
error_signal that are no longer load-bearing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:40:42 +09:00
th-kim0823
d7bfd01ef5 feat(kebab-store-sqlite): add NotIndexed typed error (fb-27)
New `SqliteStore::open_existing` API + `NotIndexed` signal for the
missing-DB case. kebab-app re-exports the type via its `error_signal`
module so kebab-cli's `error_classify` can map it to
`error.v1 { code: "not_indexed" }`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:32:04 +09:00
th-kim0823
26a2e021b0 🏗️ refactor(kebab-config): stabilize ConfigInvalid.cause prefix (fb-27)
Replace `read failed: {e}` / `parse failed: {e}` with the underscore-
slugged `read_failed:` / `parse_failed:` prefixes so kebab-cli's
error_classify (Task 8) and the error.v1 JSON Schema (Task 14) can
treat the prefix as a stable wire contract while leaving the OS /
toml-crate detail in the suffix as free-form context.

Also add the symmetric `cause` non-empty assertion to the malformed-TOML
test so a regression that empties `cause` on the parse path would be
caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:29:00 +09:00
th-kim0823
58c06664b8 feat(kebab-config): add ConfigInvalid typed error (fb-27)
Wraps every error path in `Config::from_file` (read failure, TOML parse,
validation) so downstream callers can `downcast_ref::<ConfigInvalid>()`
to build the `error.v1` wire record. kebab-app re-exports the type via
its `error_signal` module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:22:58 +09:00
th-kim0823
3efdf7ef2f 🏗️ chore(kebab-app): scaffold error_signal module (fb-27)
Re-exports existing doctor_signal entries (RefusalSignal / NoHitSignal /
DoctorUnhealthy) + LlmError from kebab-llm-local. ConfigInvalid /
NotIndexed re-exports added in subsequent tasks once the source crates
define them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:17:58 +09:00
th-kim0823
c20da0f274 📝 docs(plan): p9-fb-27 implementation plan
17-task TDD plan covering: typed signal scaffolding (kebab-app
error_signal module), ConfigInvalid + NotIndexed typed errors, SchemaV1
struct + schema_with_config facade, count_summary helper, wire_schema +
wire_error_v1 helpers, error_classify dispatcher, Cmd::Schema CLI arm,
--json mode error.v1 emission, JSON Schema literals, doc sync.

Each task = bite-sized TDD cycle (write failing test → impl → verify
pass → commit). Final task = workspace clippy + cargo test --workspace
-j 1 + manual smoke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:12:05 +09:00
th-kim0823
f01f8dfc9b 📝 docs(spec): p9-fb-27 introspection + error wire 설계 문서
`kebab schema` 신규 명령 + `error.v1` wire schema 도입을 위한 brainstorm
산출물. agent 통합 (fb-30 MCP, fb-29 daemon) 의 prerequisite — 한 번의
introspection 호출로 wire 버전 / capabilities / model versions / index
stats 를 노출하고, fatal error 가 `--json` 모드에서 stderr ndjson 으로
구조화된다.

핵심 결정:
- `kebab schema --json` 단일 명령 (정적 + 동적 통합)
- error.v1 emission 은 `--json` 모드에서만 — 비 `--json` 은 기존 stderr text 유지
- exit code 0/1/2/3 unchanged, error.v1.code 가 fine-grained 분기
- 7 code initial set (config_invalid / not_indexed / model_unreachable /
  model_not_pulled / timeout / io_error / generic) + future-additive 정책

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 10:58:24 +09:00
16b4f9fb9f 📝 docs(HANDOFF): 도그푸딩 피드백에 따른 백로그 항목 추가
- P9 dogfooding 백로그 항목 fb-26 ~ fb-42 추가
- 각 항목의 목표, 증상, 후속 작업 및 위험 요소 명시
- release 계획에 따른 0.3.0 ~ 0.6.0 분할

📝 docs(INDEX): 백로그 항목에 대한 세부 정보 추가

- fb-26 ~ fb-42 항목의 세부 정보 및 상태 추가
- 각 항목의 목표와 후속 작업 명시
- 도그푸딩 피드백에 따른 개선 사항 반영

🔧 chore(tasks): 새로운 백로그 항목 파일 생성

- p9-fb-26 ~ p9-fb-42 각 항목에 대한 개별 파일 생성
- 각 파일에 목표, 증상, 후속 작업 및 위험 요소 포함
- doogfooding 피드백을 기반으로 한 개선 사항 문서화
2026-05-06 13:26:36 +00:00
dd172af47c Merge pull request 'feat(integrations): ship Claude Code skill for kebab' (#103) from feat/claude-code-skill into main
Reviewed-on: #103
2026-05-06 09:04:31 +00:00
th-kim0823
e31871d03e feat(integrations): ship Claude Code skill for kebab
Add `integrations/claude-code/kebab/` — a Claude Code skill that calls
`kebab search --json` / `kebab ask --json` automatically when the user
asks an internal-context / runbook / wiki-lookup question. Install via
`cp -r integrations/claude-code/kebab ~/.claude/skills/` (or symlink for
in-place updates).

The shipped SKILL.md keeps its frontmatter `description` generic so the
trigger fires on any "internal / org-specific" cue — per-user keywords
(team / system / acronym) belong in a local override, not in the
repo-shipped frontmatter.

CLAUDE.md §Wire schema v1: add a paragraph documenting the
`integrations/<host>/` layout and the rule that a wire schema major bump
(v1→v2) updates each shipped integration in the same PR — same shape as
the version-cascade rule.

README.md §외부 AI 통합: replace the "~50줄 wrapper" prose with a direct
link to `integrations/claude-code/`, since the wrapper is now in-tree.
2026-05-06 18:00:53 +09:00
5398fb057c Merge pull request 'chore: bump workspace version 0.2.0 → 0.2.1 (p9-fb-25 release)' (#102) from chore/bump-v0.2.1 into main
Reviewed-on: #102
2026-05-05 12:51:33 +00:00
1409eaae51 chore: bump workspace version 0.2.0 → 0.2.1 — p9-fb-25 release
사용자 도그푸딩 / 실사용 binary 필요로 release bump (CLAUDE.md release
규약 트리거 1).

p9-fb-25 변경사항:

- WorkspaceCfg.include 제거 (옛 config 의 include = [...] silently
  무시 + 단발 deprecation warning).
- IngestItem.warnings 가 Skipped 시 사유 채움 (`unsupported media
  type: .docx` 등).
- IngestReport.skipped_by_extension: BTreeMap<String, u32> 신규
  (additive wire — v1 호환).
- CLI / TUI summary 에 breakdown (`5 skipped: 3 docx, 1 txt`).
- kebab init 헤더 + README 에 지원 형식 (md / png / jpg / pdf) 명시.

caller breaking 없음 (모든 surface 변경 backward-compat). 730
워크스페이스 테스트 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:49:10 +00:00
8dadac2a45 Merge pull request 'fix(kebab-config): p9-fb-25 — workspace.include 제거 + 지원 형식 가시성' (#101) from fix/p9-fb-25-config-include-removal into main
Reviewed-on: #101
2026-05-05 12:46:34 +00:00
e4432a2388 review(p9-fb-25): 회차 1 nit 반영 — render_skipped_breakdown 단일 source + NO_EXT_SENTINEL + 카운트 + deprecation 문구
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:35:10 +00:00
51feff5f16 docs(p9-fb-25): README + HANDOFF + HOTFIXES + INDEX + per-task spec
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:20:38 +00:00
44dee2c30f feat(kebab-cli, kebab-tui): p9-fb-25 task 6 — render skipped-by-extension breakdown
Append ": A docx, B txt, ..." after the N skipped count in both the
CLI ingest summary and TUI status_line terminal events (completed +
aborted). Breakdown is desc-sorted by count, ties broken by key
alphabetic; empty map produces no extra text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:16:43 +00:00
9545367904 feat(kebab-app): p9-fb-25 task 5 — Skipped warnings + skipped_by_extension aggregation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:13:13 +00:00
693f5582f0 feat(kebab-core, kebab-app): p9-fb-25 task 4 — IngestReport.skipped_by_extension + wire schema additive
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:06:34 +00:00
d64282433c feat(kebab-app): p9-fb-25 task 3 — init_workspace header lists supported extensions
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:55:38 +00:00
ef5d0770ae review(p9-fb-25-task1): fix kebab-app test references to removed WorkspaceCfg.include
reviewer-flagged: task 1 missed test files using cfg.workspace.include.

- crates/kebab-app/tests/common/mod.rs: SourceScope literal switched
  to ..Default::default().
- crates/kebab-app/tests/image_pipeline.rs (×3): drop dead-no-op
  cfg.workspace.include.push(...) calls; comment explains removal.
- crates/kebab-app/tests/pdf_pipeline.rs: same treatment.

Pre-fb-25 these pushes were no-ops (include was dead config field
not enforced anywhere). Removal is purely mechanical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:53:19 +00:00
7f31721a47 refactor(kebab-cli, kebab-tui): p9-fb-25 task 2 — SourceScope via ..Default::default()
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:47:39 +00:00
b22c8cfd45 feat(kebab-config): p9-fb-25 task 1 — drop WorkspaceCfg.include + deprecation probe
- Remove `pub include: Vec<String>` from `WorkspaceCfg` struct (denylist-only model).
- Drop `include: vec!["**/*.md"]` from `Config::defaults()`.
- Add `from_file` deprecation probe: raw `toml::Value` scan fires a
  one-shot `tracing::warn!` (via `OnceLock`) when an old config still
  carries `workspace.include = [...]`. serde ignores the unknown field
  cleanly (no `deny_unknown_fields`).
- Compile-fix `kebab-cli` (main.rs:329) and `kebab-tui`
  (ingest_progress.rs:39): replace `cfg.workspace.include.clone()` with
  `Vec::new()` (Task 2 will switch to `..Default::default()`).
- Two new tests: `legacy_include_field_is_ignored_silently` (backward
  compat round-trip) + `workspace_cfg_has_only_root_and_exclude_fields`
  (exhaustive destructure — compile-time guard against re-introduction).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:44:35 +00:00
7c8f1f2637 plan(p9-fb-25): TDD implementation plan — 7 tasks
Spec → 7-step plan, TDD per task.

Tasks:
1. Drop WorkspaceCfg.include + deprecation probe in Config::from_file
   (legacy include silently ignored + 단발 tracing::warn).
2. SourceScope construction cleanup (CLI + TUI 사용 ..Default::default()).
3. init_workspace header 에 지원 형식 명시 (md / png / jpg / pdf) + 테스트.
4. IngestReport.skipped_by_extension + AggregateCounts.skipped_by_extension
   (BTreeMap, stable JSON key) + wire schema additive.
5. IngestItem.warnings 채움 (`unsupported media type: .docx` 등) +
   ext_for_skip_warning helper + asset 루프 bump.
6. CLI summary + TUI status_line render breakdown (desc 정렬, 모두).
7. Docs sync — README + HANDOFF + HOTFIXES + INDEX + per-task spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:28:11 +00:00
40ca4bf27e spec(p9-fb-25): config workspace.include 제거 + 지원 형식 가시성
도그푸딩 피드백 2026-05-05:
1. config 의 include + exclude 동시 존재가 case 4 (둘 다 매치 안 함)
   에서 의미 모호.
2. 어차피 처리 가능 형식 (md / png / jpg / pdf) 이 정해져 있으니
   사용자에게 명시 필요.

설계 핵심:

- `WorkspaceCfg.include: Vec<String>` 제거 (denylist-only). 옛 config
  의 `include = [...]` 은 silently 무시 + Config::load 가 단발
  deprecation warning emit.
- `IngestItem.warnings` 에 skip 사유 채움 (`unsupported media type:
  .docx` / `kb:// URI not yet supported`).
- `IngestReport.skipped_by_extension: BTreeMap<String, u32>` 신규
  (additive wire — release 트리거 안 됨 per CLAUDE.md). key =
  lowercase ext (`docx`, `txt`), no-ext = `<no-ext>` sentinel.
- CLI / TUI summary 에 breakdown 표시 (`90 skipped: 80 docx, 5 txt,
  5 epub`) — 모두, desc 정렬.
- README + `kebab init` config.toml 주석에 지원 형식 명시.

Spec status `planned`. 다음 단계: writing-plans skill 로 implementation
plan 작성.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:21:42 +00:00
a68c47124f Merge pull request 'chore(CLAUDE.md): release / binary version bump 규약 추가' (#100) from chore/claude-md-release-rule into main
Reviewed-on: #100
2026-05-05 02:44:15 +00:00
a98767088f review(claude-md-release-rule): 회차 1 nit 반영 — 헷갈리게 typo + wire schema minor/major 구분
- typo `햇갈리게` → `헷갈리게` (표준 한글 표기).
- wire schema 트리거에 additive (minor) vs breaking (v1→v2 major)
  구분 명시: additive 만으로는 release 트리거 안 됨, 진정한
  breaking 만 트리거. additive 예시로 IngestReport.unchanged 인용.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:43:46 +00:00
c6a67555d8 chore(CLAUDE.md): add release / binary version bump rule
도그푸딩 / 실사용 binary 가 필요할 때 release 버전을 bump 하는 규약을
명시. 이전엔 first-release 시점만 ad-hoc 결정했지만 후속 cycle 에서
일관성 위해 트리거 / 절차 / pre-1.0 minor vs patch 기준 정리.

트리거:
- 사용자가 새 바이너리로 도그푸딩 / 실사용을 명시.
- breaking schema change (V00X migration / wire v1→v2) 머지 후.
- frozen design contract 변경 머지 후.

절차: `gitea-release v<X.Y.Z>` + 도그푸딩-영향 surface 변경 중심
release notes. pre-1.0 minor = wire additive + surface 변경 누적,
patch = bug fix only.

bump 시점 = release 시점 같은 commit (tag 가 정확히 그 Cargo.toml
version snapshot 가리키도록). v0.1.0 의 bump-없이-tag 패턴은
후속 release 의 대상 commit 혼동 위험.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:42:23 +00:00
d417f843f8 Merge pull request 'chore: bump workspace version 0.1.0 → 0.2.0' (#99) from chore/bump-v0.2.0 into main
Reviewed-on: #99
2026-05-05 02:36:45 +00:00
ce68885d92 chore: bump workspace version 0.1.0 → 0.2.0 for next dev cycle
v0.1.0 release tagged on 2026-05-05 — first official release covering
P0~P4 + P5 + P6~P7 + P9 (UI) 의 도그푸딩 사이클 1회 완성.

이후 도그푸딩 follow-up (p9-fb-21 ~ p9-fb-24) 까지 v0.1.0 에 포함:
- p9-fb-22: TUI input cursor mid-string editing + Ask follow-tail.
- p9-fb-23: incremental ingest (skip unchanged docs).
- p9-fb-24: TUI status/key bar + Library 컬럼 헤더 + PgUp/PgDn.

다음 dev cycle 부터 0.2.0. minor bump rationale (semver pre-1.0):

- Wire schema additive (`IngestReport.unchanged`, `IngestEvent` 의
  `Unchanged` variant) — backward-compat 유지 but 신규 surface.
- 신규 CLI flag (`--force-reingest`).
- 신규 SQLite migration V006 (incremental ingest 컬럼).
- TUI surface 변경 (status bar 항상 노출, Library 헤더 row, PgUp/PgDn,
  cursor 편집).
- IngestOpts struct 도입 (AskOpts 패턴) — 기존 wrapper 보존하므로
  caller breaking 은 없음.

기존 ~723 워크스페이스 테스트 무수정 통과 (CARGO_PKG_VERSION 가 env!
로 동적 — TUI status bar 테스트 자동 추적).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:28:20 +00:00
268 changed files with 54944 additions and 534 deletions

2
.gitignore vendored
View File

@@ -1,6 +1,6 @@
.superpowers/
.worktrees/
.claude/
/target/
/target
**/*.rs.bk
Cargo.lock.bak

View File

@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
## Project
Single-user local-first knowledge base + RAG. Rust 2024 workspace, ~20 crates, single binary (`kebab`). All inference is local (Ollama + fastembed + whisper.cpp).
Single-user local-first knowledge base + RAG. Rust 2024 workspace, ~21 crates, single binary (`kebab`). All inference is local (Ollama + fastembed + whisper.cpp).
The repo's documentation is split by audience — don't duplicate across them:
@@ -27,7 +27,7 @@ cargo build --release # produces target/release/kebab
`-j 1` for the full workspace test isn't optional: 18 integration-test binaries each link `lance` + `datafusion` + `arrow` + `tantivy` and the parallel link step exhausts memory (linker gets SIGKILL'd, build silently fails partway). Per-crate runs are fine in parallel.
`target/` is 610 GB after a fresh build (DataFusion + Lance + fastembed + 18 × test-binary debug info). The dev/test profile is already trimmed (`debug = "line-tables-only"`, `split-debuginfo = "unpacked"` — see workspace `Cargo.toml`). Run `cargo clean` after phase merges if disk pressure shows up; backtraces still resolve to function + line.
`target/` is 610 GB after a fresh build but **balloons to 90+ GB after a few task cycles** (each fb-* batch adds incremental compile artifacts on top of the existing 18 × test-binary debug info). The dev/test profile is already trimmed (`debug = "line-tables-only"`, `split-debuginfo = "unpacked"` — see workspace `Cargo.toml`). Run `cargo clean` **routinely after each merged PR**, not just "if pressure shows up" — disk space is tight and recovery via `cargo clean` is cheap (one re-link per crate on next build). Verified pattern: 92 GB → 0 GB in seconds, backtraces still resolve to function + line.
## The facade rule
@@ -54,18 +54,38 @@ Each task spec lists `Allowed dependencies` and `Forbidden dependencies` per des
- `kebab-core` MUST NOT depend on any other `kebab-*` crate. Domain types only.
- `kebab-eval`'s `metrics` and `compare` modules MUST NOT import retrieval / embedding / LLM crates directly. The runner is allowed to use `kebab-app`'s facade (P5-1 inheritance — see deviations in that task spec).
- UI crates (`kebab-cli`, future `kebab-tui`, `kebab-desktop`) MUST NOT import `kebab-store-*` / `kebab-llm-*` / `kebab-parse-*` directly — only `kebab-app`.
- UI crates (`kebab-cli`, `kebab-mcp`, `kebab-tui`, future `kebab-desktop`) MUST NOT import `kebab-store-*` / `kebab-llm-*` / `kebab-parse-*` directly — only `kebab-app`.
Read the relevant task spec's deps section before adding an import. New crates inherit the same boundary rules.
## Wire schema v1
All `--json` output carries a `schema_version` field (`ingest_report.v1`, `search_hit.v1`, `answer.v1`, `doctor.v1`, …). Schemas live in `docs/wire-schema/v1/`. The wire shape is the contract for external integrations (Claude Code skills, MCP, etc.); breaking it requires a `*.v2` major bump and parallel-running both for one phase.
All `--json` output carries a `schema_version` field. Current schemas: `ingest_report.v1`, `ingest_progress.v1`, `search_hit.v1`, `answer.v1`, `doctor.v1`, `reset_report.v1`, `schema.v1`, `error.v1`, `chunk_inspection.v1`, `citation.v1`, `doc_summary.v1`. Schemas live in `docs/wire-schema/v1/`. The wire shape is the contract for external integrations (Claude Code skills, MCP, etc.); breaking it requires a `*.v2` major bump and parallel-running both for one phase. In `--json` mode, fatal errors emit `error.v1` to stderr as ndjson (non-`--json` mode keeps plain stderr text); exit codes 0/1/2/3 are unchanged — `error.v1.code` provides fine-grained agent branching.
In-tree integration packages live under `integrations/<host>/` — currently `integrations/claude-code/kebab/` (a Claude Code skill that calls `kebab search --json` / `kebab ask --json`). Any wire schema major bump (v1→v2) MUST update each shipped integration in the same PR, same as the version-cascade rule below. Per-user trigger keywords (team / system / acronym) belong in the user's local copy of the skill, not in the repo-shipped frontmatter — keep `integrations/claude-code/kebab/SKILL.md`'s `description` generic.
## Versioning cascade
`parser_version` / `chunker_version` / `embedding_version` / `prompt_template_version` / `index_version` follow the cascade rule in design §9. Changing any of these invalidates downstream records (chunks, embeddings, eval runs, …). When changing a version: either ship a re-process job or treat it as a breaking schema bump. The eval runner snapshots all five into `eval_runs.config_snapshot_json`.
## Release / binary version bump
Workspace `Cargo.toml``version` 은 binary release 의 정체성. 다음 트리거 중 하나 발생 시 **bump + 새 release 컷**:
- 사용자가 새 바이너리로 **도그푸딩** 또는 **실사용** 을 할 필요가 있다고 명시.
- breaking schema change (V00X migration / wire schema major bump v1→v2 등) 가 머지된 후 — 이전 릴리즈 binary 가 새 DB / 새 wire 와 호환 안 됨. wire 의 additive minor 변경 (예: `IngestReport.unchanged` 같은 필드 추가) 은 backward-compat 이라 본 트리거에 해당 안 됨.
- frozen design contract 변경 (design §X 갱신) 이 머지된 후.
Bump 자체는 단순 minor / patch 한 줄 수정 (`Cargo.toml` workspace `version`) — 이미 모든 kebab-* crate 가 `version = { workspace = true }` 라 자동 cascade. 동시에 `Cargo.lock` 자동 갱신.
Release 절차:
1. `gitea-release v<X.Y.Z>` (gitea-ops skill) 으로 tag + push + release notes.
2. release notes 는 사용자 도그푸딩에 영향 가는 surface 변경 위주 — wire schema 추가, CLI flag 신규, TUI 키 변경, V00X migration 등.
3. 프리-1.0 (`0.x.y`) 단계: minor bump 시 wire schema additive / surface 변경 누적, patch bump 시 bug fix only.
**bump 시점 = release 시점 같은 commit**. 즉 commit `chore: bump version 0.x → 0.y` 직후 같은 commit 에 tag. v0.1.0 (`2319206`) 처럼 bump 없이 tag 만 찍는 패턴은 후속 release 가 대상 commit 을 헷갈리게 함 — pre-release snapshot 은 SHA reference 로 충분.
## Naming + paths
- Crate prefix: `kebab-` (kebab-case package, `kebab_` snake_case in Rust modules).
@@ -74,6 +94,7 @@ All `--json` output carries a `schema_version` field (`ingest_report.v1`, `searc
- XDG paths: `~/.config/kebab/`, `~/.local/share/kebab/`, `~/.cache/kebab/`, `~/.local/state/kebab/`.
- SQLite filename: `kebab.sqlite` (under `data_dir`).
- Workspace ignore: `.kebabignore` (per directory).
- `_external/` (under `workspace.root`): single-file / stdin ingest 가 외부 file 을 deterministic 명명 (`<blake3-12>.<ext>`) 으로 copy. 첫 생성 시 `.kebabignore` 자동 append.
The migration from the old `kb` name lives in commits `911fb49 / f1a448d / f9714aa`. If you spot a leftover `kb` reference, treat it as a leftover and fix it (the rename PR sweep covered crates/, docs/, tasks/, README, design doc, fixtures — but workspace root `Cargo.toml` comments needed a follow-up; assume similar misses are possible).

908
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -22,6 +22,8 @@ members = [
"crates/kebab-parse-image",
"crates/kebab-parse-pdf",
"crates/kebab-tui",
"crates/kebab-mcp",
"crates/kebab-parse-code",
]
[workspace.package]
@@ -29,7 +31,7 @@ edition = "2024"
rust-version = "1.85"
license = "MIT OR Apache-2.0"
repository = "https://github.com/altair823/kebab"
version = "0.1.0"
version = "0.8.2"
[workspace.dependencies]
anyhow = "1"
@@ -71,10 +73,27 @@ futures = "0.3"
# pass; pulled into the workspace deps so future crates can share the
# same major.
regex = "1"
# MCP (Model Context Protocol) SDK. server + macros + transport-io provide
# stdio JSON-RPC transport for `kebab-mcp` (p9-fb-30). schemars feature
# exposes the derive macro used by tool input schemas.
rmcp = { version = "1.6", default-features = false, features = ["server", "macros", "transport-io", "schemars"] }
# Dev-only HTTP mock server for kebab-llm-local Ollama adapter tests. Requires
# a tokio runtime to host its mock server (the runtime adapter crate stays
# sync via reqwest::blocking — wiremock is dev-only there).
wiremock = "0.6"
base64 = "0.22"
# Pure-Rust git library for repo metadata detection (kebab-parse-code).
# No `git` binary required. Default features include thread-safety + most
# object-reading capabilities needed for HEAD name + commit SHA queries.
gix = { version = "0.70", default-features = false, features = ["revision"] }
# Rust source parsing for code ingest (kebab-parse-code, p10-1A-2). The
# chunker stays tree-sitter-free — AST work is parser-side per design §6.3.
tree-sitter = "0.26"
tree-sitter-rust = "0.24"
# Python / TS / JS grammars for code ingest (kebab-parse-code, p10-1B).
tree-sitter-python = "0.25.0"
tree-sitter-typescript = "0.23.2"
tree-sitter-javascript = "0.25.0"
# Disk-footprint trim for dev / test builds. Codegen, opt-level, and
# behavior are unchanged — only DWARF debug info is reduced (line

View File

@@ -20,6 +20,7 @@ P0P5 + P6 + P7 + P9-1/2/3/4 (Library / Search / Ask / Inspect) 머지 완료.
| **P7** | PDF text + page citation | `kebab-parse-pdf` | P5 | ✅ 완료 (3/3 component, page-level chunker + ingest wiring) |
| **P8** | 음성 transcription + timestamp citation | `kebab-parse-audio` | P5 | ⏸ 보류 (whisper-rs 시스템 dep brainstorm 필요) |
| **P9** | TUI + desktop app | `kebab-tui`, `kebab-desktop` | P5 | 🟡 진행 (4/5 component — P9-1/2/3/4 완료 [Library / Search / Ask / Inspect], P9-5 desktop 예정 · 도그푸딩 피드백 **20/20 ✅**) |
| **P10** | code ingest framework | `kebab-parse-code` | P5 | 🟡 진행 중 — 1A-1 ✅ (wire schema + parse-code skeleton + filter flags), 1A-2 ✅ (Rust AST chunker, tree-sitter-rust, `code-rust-ast-v1` — v0.7.0), **1B 🟡 PR 오픈** (Python `code-python-ast-v1` + TypeScript `code-ts-ast-v1` + JavaScript `code-js-ast-v1` — 3 언어 dogfooding 가능, v0.8.0 대기) |
P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
@@ -31,6 +32,14 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
머지 후 발견된 모든 deviation / hotfix 의 dated 로그는 [tasks/HOTFIXES.md](tasks/HOTFIXES.md). 본 요약은 \"누군가가 인수받을 때 알아두면 시간을 많이 절약하는\" 항목만:
- **2026-05-20 P10-1B (Rust 1A symbol path 비일관 + expression-level 함수 미방출)** — (a) Rust `code-rust-ast-v1` 은 file-scope nesting 만 (workspace path prefix 없음), 1B 의 Python/TypeScript/JavaScript 는 workspace 경로 → module path prefix 사용 (비일관 수용, retrofit = chunker_version bump + reindex 필요, 사용자 명시 요청까지 보류); (b) TS/JS 의 `const foo = () => {...}` 같은 expression-level 함수는 `<top-level>` glue 로 처리됨 (declaration-level 단위만 1B 1차 범위). 자세한 내용: `tasks/HOTFIXES.md` (2026-05-20) 두 항목.
- **2026-05-19 P10-1A-2 (code_rust_ast_v1.rs + SourceType)** — `AST_CHUNK_MAX_LINES` 상수가 `IngestCodeCfg.ast_chunk_max_lines` 를 읽지 않고 모듈 상수 200 고정 (Chunker trait 이 per-medium config 미노출); `SourceType::Code` variant 부재로 code 파일이 `SourceType::Note` 로 분류됨 — 두 항목 모두 `tasks/HOTFIXES.md` (2026-05-19) 에 기록.
- **2026-05-07 fb-26 (progress.rs)** — `Aborted` unconditional writeln (TTY duplicate) + `Completed` TTY no summary fixed; `KEBAB_PROGRESS=plain` env + quiet suppression added
- **2026-05-07 fb-28 (main.rs)** — `--readonly` (KEBAB_READONLY) blocks Ingest/IngestFile/IngestStdin/Reset; `--quiet` suppresses progress stderr; error.v1 code: "readonly_mode"
- **2026-05-07 macOS XDG path collision (config 사라지는 버그)** — `dirs` crate 가 macOS 에서 `config_dir()``data_dir()` 둘 다 `~/Library/Application Support/` 반환 → `reset --data-only` 가 config 파일까지 삭제. Fix: `~/.config`, `~/.local/share`, `~/.cache` 직접 사용. 새 경로: config `~/.config/kebab/`, data `~/.local/share/kebab/`, cache `~/.cache/kebab/`. `Config::load(None)` 이 macOS legacy path 에서 자동 마이그레이션. 자세한 내용: `tasks/HOTFIXES.md`.
- **2026-05-07 P9 post-도그푸딩 (p9-fb-31)** — `kebab ingest-file <path>` + `kebab ingest-stdin --title <T>` 두 신규 subcommand + MCP tool `ingest_file` / `ingest_stdin` (4 → 6 tool). agent 가 fetch 한 web markdown / 외부 file 을 KB 에 즉시 저장. workspace 외부 file 은 `<workspace.root>/_external/<blake3-12>.<ext>` 로 copy (deterministic 명명 → idempotent). `_external/` 디렉토리 첫 생성 시 `.kebabignore` 자동 append (walk 무한 루프 방지). stdin 은 markdown 전용 + flag (`--title`, `--source-uri`) → frontmatter 자동 prepend. .kebabignore 매치 시 stderr warn 후 진행 (explicit ingest = bypass intent). fb-30 의 v1 read-only MCP 정책 변경 — 첫 mutation tool 도입. spec: `tasks/p9/p9-fb-31-single-file-stdin-ingest.md`. design: `docs/superpowers/specs/2026-05-07-p9-fb-31-single-file-stdin-ingest-design.md`.
- **2026-05-07 P9 post-도그푸딩 (p9-fb-30)** — `kebab mcp` 신규 subcommand + new crate `kebab-mcp` (lib only) — stdio JSON-RPC server. 4 read-only tool (`search` / `ask` / `schema` / `doctor`) 가 `kebab-app` facade 위에 build. rmcp 1.6 SDK 채택, manual `tools/list` + `tools/call` dispatch (rmcp 의 `#[tool_router]` 매크로 대신). `error_classify` 모듈을 `kebab-cli``kebab-app::error_wire` 로 promotion (UI crate 끼리 import 회피, facade 룰 준수). `ErrorV1``schema_version: String` 필드 추가 — kebab-mcp 의 직접 serialize 경로에서도 wire 정합. `KebabAppState``(Config, Option<PathBuf>)` carry — doctor tool 의 path-aware behavior 위해. ask + search arm 의 `tokio::task::spawn_blocking` wrap — `OllamaLanguageModel` 의 reqwest blocking client 가 async 안에서 panic 회피. capability flag `mcp_server` `false``true`. agent integration MVP 완성 — Claude Code / Cursor / OpenAI Agents 등 host-agnostic 사용 가능. spec: `tasks/p9/p9-fb-30-mcp-server.md`. design: `docs/superpowers/specs/2026-05-07-p9-fb-30-mcp-server-design.md`.
- **P3-5 / P4-3 `--config` 누락** — `kebab-cli``--config <path>` 를 honor 하려면 `kebab_app::*_with_config` companion 을 호출해야 함. 두 번 같은 모양으로 회귀했음.
- **P6-2 OCR 기본 엔진** — spec literal 의 Tesseract 가 시스템 dep 부담으로 거부됨, Ollama vision LM 으로 대체. `OcrEngine` trait 그대로라 future swap 가능.
- **P6-3 caption** — `GenerateRequest.images` 필드를 `kebab-core::LanguageModel` trait 에 신설. 기존 caller 모두 `images: Vec::new()` 로 마이그레이션.
@@ -60,10 +69,12 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
- **2026-05-03 P9 도그푸딩 후속 (p9-fb-12 follow-up)** — heuristic 제거 (partial PR 의 deferred 부분 finalize). `search::is_typing_mod` (CTRL/ALT chord filter) 함수 삭제 + `ask::handle_key_ask` 의 input-empty heuristic 삭제. 새 dispatch: `search::handle_key_search``i` (chunk inspect) / `g` (editor jump) pre-pass 가 `state.mode == Mode::Normal` 일 때만 fire (Insert 에서는 typed char). main match 의 `j`/`k`/Char(c) 가 `state.mode` 로 분기 (Normal → 선택 이동, Insert → input.push). `ask::handle_key_ask``e`/`j`/`k` 도 동일 패턴 — Normal 에서 toggle/scroll, Insert 에서 input typing. 테스트 fixture (`tests/search.rs::fresh_app`, `tests/ask.rs::fresh_app`) 가 `app.mode = Mode::auto_for(focus)` 로 run-loop 동작 mirror. 기존 nav 테스트 (j_k_move, g_key_enqueues, e_toggles) 는 explicit `app.mode = Mode::Normal` 추가, 신규 4 테스트 (j_in_insert_types / arbitrary_char_in_normal_noop / e_types_in_insert / jk_scroll-in-normal-type-in-insert) 가 mode-authoritative 동작 pin. spec status `in_progress``completed`. spec: `tasks/p9/p9-fb-12-tui-mode-machine.md`.
- **2026-05-03 P9 도그푸딩 후속 (p9-fb-10 partial)** — TUI CJK rendering helpers. `kebab-tui::input::{display_width, truncate_to_display_width}` 신규 — `unicode-width` 위에서 column-단위 width 계산 (ASCII=1, Hangul/CJK/fullwidth=2, combining=0) + char-boundary 안전 truncate (wide char 를 split 없이 keep-or-omit, ellipsis 1 col). library.rs 의 중복 `truncate_to_display_width` private fn 제거 — 단일 source. 9 unit tests (ASCII / Hangul / Japanese / mixed / truncate fits·overflow·zero-cols·wide-char-boundary / `String::pop` char-aware sanity) + 1 integration render test (Korean + Japanese fixture, TestBackend 80×20, 한글/일본어 글자가 frame 에 살아남음 확인). spec 의 `InputBuffer` struct (cursor 가 column 단위 wide-char width 추적) 도입은 follow-up — Ask/Search/Editor pane 의 String + cursor 일괄 마이그레이션이 회귀 표면이 커서 helper 만 먼저 머지. backspace 는 모든 pane 이 이미 `String::pop()` 사용 (char-aware) → byte-boundary 안전성 helper 없이도 확보. crossterm 0.28 이 native IME composing 미노출 — preedit handling out of scope. spec status `planned``in_progress`. spec: `tasks/p9/p9-fb-10-tui-cjk-input.md`.
- **2026-05-04 P9 post-도그푸딩 (p9-fb-23)** — Incremental ingest. 사용자 도그푸딩 피드백: 변하지 않은 문서는 다시 ingest 하지 않기. blake3 checksum + parser_version + chunker_version + embedding_version 4개 input 이 모두 일치할 때 parse/chunk/embed/vector upsert 모두 회피. SQLite V006 마이그레이션 — `documents``last_chunker_version` + `last_embedding_version` 컬럼 추가. 신규 `IngestItemKind::Unchanged` variant + `IngestReport.unchanged` + `AggregateCounts.unchanged` (wire schema additive). `IngestOpts { progress, cancel, force_reingest }` struct 도입 — `AskOpts` 패턴. `--force-reingest` CLI flag 로 skip 우회. 비용 dominator (fastembed) 가 변경된 / 새 doc 에만 발생. spec: `tasks/p9/p9-fb-23-incremental-ingest.md`. HOTFIXES `2026-05-04 — p9-fb-23` 항목이 version cascade 명시 동작의 source of truth.
- **2026-05-05 P9 post-도그푸딩 (p9-fb-25)** — Config 의 `workspace.include` 필드 제거 + 지원 형식 가시성. 사용자 도그푸딩 피드백: include + exclude 동시 존재가 case 4 (둘 다 매치 안 함) 의미 모호 + 어차피 처리 가능 형식 (md / png / jpg / pdf) 이 정해져 있으니 명시 필요. `WorkspaceCfg.include` 제거 (옛 config 의 `include = [...]` 은 silently 무시 + 단발 deprecation warning). `IngestItem.warnings` 가 Skipped 시 사유 (`"unsupported media type: .docx"` 등) 채움. `IngestReport.skipped_by_extension: BTreeMap<String, u32>` 신규 (additive wire — release 트리거 안 됨). CLI / TUI summary 에 breakdown 표시 (`"5 skipped: 3 docx, 1 txt, 1 epub"`). README + `kebab init` 헤더 주석에 지원 형식 명시. spec: `tasks/p9/p9-fb-25-config-include-removal.md`. HOTFIXES `2026-05-05 — p9-fb-25` 가 source of truth.
- **2026-05-04 P9 post-도그푸딩 (p9-fb-24)** — TUI status/key bar + Library 컬럼 헤더 + Ask/Inspect PgUp/PgDn. 사용자 도그푸딩 3 건 (Library 컬럼 의미 부재, 페이지 스크롤 키 부재, 상태바 + 버전 정보 항상 노출 요청) 을 단일 PR 로 통합. bottom 영역을 status bar (1 row, version + pane + docs + dynamic state) + key hint bar (1 row, 기존 `footer_hints` 그대로) 두 줄로 분할; 기존 ingest progress dedicated row 는 status bar 의 dynamic slot 에 흡수 (priority cascade: streaming → searching → indexing → idle). Library `List` 위에 `format_doc_header` 행 + Layout 분할로 헤더 표시 (TITLE / TAGS / UPDATED / CHUNKS, display-width 정렬). `kebab-tui::pager::PAGE_STEP = 10` 신규 — Ask 의 PgUp/PgDn 추가 + Inspect 의 기존 +/-10 hardcode 가 같은 상수 참조로 통일. Ask 의 page-scroll 은 `j`/`k` 와 동일하게 `follow_tail = false` 로 freeze. spec: `tasks/p9/p9-fb-24-tui-affordances.md`. HOTFIXES `2026-05-04 — p9-fb-24` 항목이 footer 단행 row (p9-fb-13) + ingest dedicated row (p9-fb-03) 와의 layout 충돌의 source of truth.
- **2026-05-04 P9 post-도그푸딩 (p9-fb-22)** — TUI 입력 cursor mid-string 편집 + Ask follow-tail auto-scroll. Gitea #94 (입력 후 커서 이동 안 됨) + #95 (새 응답 자동 스크롤 안 됨) 두 건. `InputBuffer` 의 cursor 모델을 byte-position 기반으로 재구성 — cursor 가 끝일 때 기존 append 동작과 backwards-compatible, mid-string 일 때는 `←/→/Home/End/Delete` 로 편집. `AskState``follow_tail: bool` (default true). `Paragraph::line_count(width)` (ratatui `unstable-rendered-line-info` feature 활성화) 로 매 프레임 wrapped row 수 계산해 follow-tail 시 scroll 을 bottom 에 pin. `j`/`k` 가 follow-tail 끄고 `Shift-G` 가 다시 켬. 12 신규 InputBuffer unit + 6 신규 Ask integration. spec: `tasks/p9/p9-fb-22-tui-cursor-and-autoscroll.md`. HOTFIXES 항목 `2026-05-04` 가 live cursor 모델 source of truth.
- **2026-05-03 P9 post-도그푸딩 (p9-fb-21)** — `i` 가 universal Normal→Insert toggle (모든 pane). 이전 mode_intercept 는 Library/Inspect/Jobs 만 `i` intercept 였고 Search/Ask 는 fall-through (자동 INSERT 가정). 사용자가 Esc 로 NORMAL 로 빠진 후 Insert 복귀 키 없어 dead-end → 도그푸딩에서 보고됨. mode_intercept 의 `(Char('i'), Normal, _)` arm 이 pane 무관 모두 INSERT flip. Search 의 chunk inspect 키 `i``o` rebind (vim "open") 으로 충돌 해소. footer hint 모든 (pane, mode, filter) 조합 첫 fragment = `F1 도움말` (cheatsheet binding discoverability). Search/Ask Normal hint 에 `i 입력모드` fragment 추가. cheatsheet popup Global/Search/Ask section 갱신. 6 신규 unit + 3 기존 갱신. spec: `tasks/p9/p9-fb-21-tui-insert-key-discoverability.md` (status `completed` 직접). HOTFIXES 항목이 Search `i``o` rebind 의 source of truth.
- **2026-05-03 P9 도그푸딩 후속 (p9-fb-10 follow-up)** — InputBuffer struct + 모든 text-input pane 마이그레이션 + cursor column 정렬. `kebab-tui::input::InputBuffer { content, cursor_col }` 신규 — `push_char` / `pop_char` / `clear` / `take` 가 wide-char 단위로 cursor_col 진행 (ASCII=1, Hangul/CJK=2, combining=0). `SearchState.input` / `AskState.input` / `FilterEdit.{tags_buf, lang_buf}` 가 InputBuffer 로 교체. render 단계에서 `f.set_cursor_position(...)``block.inner(area)` 기반 prompt 폭 + cursor_col 으로 caret 을 정확한 column 에 배치 (right-edge clamp). ratatui 0.28 의 cursor visibility 는 `cursor_position` Some/None 으로 자동 결정 — Search/Ask/Filter 가 `Some` 이라 caret 보임, Library/Inspect 는 `None` 이라 hidden. Korean lexical 검색은 `crates/kebab-app/tests/search_korean.rs` 에서 ingest → search → 결과 한 건 이상 + Korean 파일 stem 매칭 assert 로 회귀 핀. `lexical_query` test helper 가 `crates/kebab-app/tests/common/mod.rs` 로 promotion. spec status `in_progress``completed`. spec: `tasks/p9/p9-fb-10-tui-cjk-input.md`.
- **2026-05-07 P9 post-도그푸딩 (p9-fb-27)** — `kebab schema [--json]` introspection 명령 + `error.v1` wire 도입. 정적 (wire schemas / capabilities / models) + 동적 (stats) 한 번에. `--json` 모드에서 fatal error 가 stderr ndjson 으로 emit (비 `--json` 은 기존 stderr text 유지). exit code 0/1/2/3 unchanged — `error.v1.code` 가 fine-grained 분기. fb-30 MCP `initialize` capability matrix 의 prerequisite. spec: `tasks/p9/p9-fb-27-introspection-and-error-wire.md`. design: `docs/superpowers/specs/2026-05-07-p9-fb-27-introspection-and-error-wire-design.md`.
- **2026-05-03 P9 도그푸딩 피드백 20/20 ✅** — `tasks/p9/p9-fb-01..20` 모든 spec status `completed`. 사용자가 `kebab` 직접 돌려서 수집한 UX 잡음 (ingest 진행 표시 부재, mode 혼란, CJK column drift, multi-turn 부재, citation 부재 등) 이 모두 코드 또는 spec-acknowledged-deferred 형태로 해소. 도그푸딩 사이클 한 바퀴 완성 — P9-5 desktop tauri 와 별개로 TUI/CLI 사용자 경험 측면은 한 단계 안정화. P9 phase row 는 P9-5 미진행이라 🟡 유지.
- **2026-05-03 P9 도그푸딩 후속 (p9-fb-13 follow-up)** — verb-form hint line 재구성. `pub fn footer_hints(focus: Pane, mode: Mode, filter_open: bool) -> &'static str` 신규 (run.rs). 한국어 동사구 (`"위로"` / `"아래로"` / `"필터"` / `"타이핑 검색어"` / `"Esc 로 NORMAL 모드"` 등) + mode-aware (NORMAL = navigation verbs, INSERT = typing + Esc reminder) + Library filter overlay 별 분기. 8 unit tests pin 모든 (pane, mode, filter) 조합 — exhaustive non-empty + Library Normal/filter, Search Normal/Insert, Ask Normal/Insert, Inspect Normal 별 verb fragment 존재 검증. spec status `in_progress``completed` — p9-fb-13 partial 의 deferred verb-form 항목이 닫힘.
- **2026-05-03 P9 도그푸딩 후속 (p9-fb-13)** — TUI cheatsheet popup. `kebab-tui::cheatsheet::render_cheatsheet(f, area, app)` 신규 — 70%/60% centered modal, sections (Global / Library / Search / Ask / Inspect) + global toggle table + 현재 focused pane footer. `App.cheatsheet_visible: bool` 필드 + `pub fn cheatsheet_visible()` getter. run loop `cheatsheet_intercept(app, key)` 가 mode_intercept 보다 먼저 dispatch — `F1` 토글 (open/close), `Esc` 가 visible 일 때 닫기 (mode_intercept 를 우회해서 cheatsheet 닫기 가 mode flip 도 발동시키지 않도록), 그 외 키는 fall-through (popup 열린 채 navigation 가능). modifier-bearing F1 (Ctrl-F1 등) 은 무시. **HOTFIXES 기록**: spec 의 `?` trigger 가 Library 의 quick-Ask binding 과 충돌해서 `F1` 으로 rebind. spec 의 verb-form hint line 재구성은 별 후속 PR (기존 footer 가 동일 역할). spec status `planned``in_progress` (verb hint deferral 으로 partial). spec: `tasks/p9/p9-fb-13-tui-cheatsheet.md`.
@@ -78,6 +89,18 @@ P0~P5 직렬. P6~P9 P5 이후 병렬 가능.
P9-2/3/4 는 P9-1 의 parallel-safety contract (sub-state slot 패턴) 덕에 병렬 진행 가능 — 같은 `App` 손대지 않음.
### P9 dogfooding 백로그 (fb-26 ~ fb-42) — release 분할
2026-05-06 도그푸딩 누적 피드백 + "AI agent 가 kebab 을 쓰게 한다" 궁극 목표용 surface 확장. cascade 영향 / 분량 고려해 한 minor 에 묶지 않고 분할.
- **0.3.0 — agent foundation** ✅ cut 2026-05-07: fb-26 (log), fb-27 (introspection/error wire), fb-28 (readonly/quiet). ~~fb-29 (daemon)~~ → 🚫 **deferred** — fb-30 stdio MCP 가 동일 가치를 daemon 복잡도 없이 제공.
- **0.4.0 — agent integration (MCP)** ✅ cut: fb-30 (MCP stdio), fb-31 (single-file/stdin ingest).
- **0.5.0 — agent surface refinement (additive)** ✅ cut 2026-05-10: fb-32 (stale doc indicator), fb-33 (streaming ask), fb-34 (output budget controls), fb-35 (verbatim fetch), fb-36 (search filter args), fb-37 (trace + stats). 모두 wire schema additive minor.
- **0.6.0 — RAG quality** 🟡 진행: fb-38 (score semantics) ✅ 머지 (2026-05-10), fb-40 (fact-grounded answer / rag-v2 prompt) ✅ 머지 (2026-05-10), fb-39 (retrieval precision tuning, embedding_version cascade) — 미진행 (eval golden set 선행 필요).
- **0.7.0 또는 P+**: fb-41 (multi-hop reasoning, XL), fb-42 (bulk multi-query / rerank, Nice).
각 fb spec frontmatter 의 `target_version` 필드가 source of truth. INDEX.md 의 release subheader 도 동일 grouping.
## 검증된 운영 동작 (release binary, fastembed enabled)
P7-3 머지 직후 25 시나리오 smoke 통과 — markdown + image + PDF 5 자산 워크스페이스에서 doctor / ingest / list / inspect / search (lex/vec/hybrid) / re-ingest / byte-edit re-ingest / corrupt PDF / RAG ask + page citation 모두. 자세한 시나리오 표는 conversation 기록 참조; 워크스페이스에 직접 돌려보는 절차는 [docs/SMOKE.md](docs/SMOKE.md).

View File

@@ -7,7 +7,7 @@
- **Rust toolchain** ≥ 1.85 (workspace 가 edition 2024 + resolver 3 사용). [rustup](https://rustup.rs) 권장.
- **Ollama** — `kebab ask` 와 이미지 OCR/caption 가 사용. `https://ollama.com/download` 에서 설치 후 `ollama serve` 실행. 기본 LLM 은 gemma4 계열 (`ollama pull gemma4:e4b`) — OCR / caption 도 같은 family 라 모델 하나만 pull 하면 됨. 더 큰 variant 원하면 `gemma4:26b` 등으로 config override. config 의 `[models.llm].endpoint` 에 host:port 명시.
- **빌드 디스크** — 첫 빌드 시 `target/` 가 610 GB (Lance + DataFusion + fastembed). 여유 확인.
- **fastembed 모델** — 첫 `kebab ingest``multilingual-e5-small` (~470 MB) 자동 다운로드.
- **fastembed 모델** — 첫 `kebab ingest``multilingual-e5-large` (~1.3 GB, fb-39b) 자동 다운로드. `config.toml` 에서 `model = "multilingual-e5-small"` 로 명시하면 이전 모델 사용.
## 설치
@@ -42,7 +42,7 @@ cargo install --git https://gitea.altair823.xyz/altair823-org/kebab.git --bin ke
# 첫 실행 — XDG 경로에 데이터 디렉토리 + config.toml 생성
kebab init
# config 손보고 — `[workspace] include` 에 *.md / *.png / *.pdf 등 추가, 모델 endpoint 등
# config 손보고 — workspace.root, 모델 endpoint 등 설정 (지원 형식: md / png / jpg / pdf / rs / py / ts / js)
${EDITOR:-vi} ~/.config/kebab/config.toml
# 색인 (Markdown / 이미지 / PDF 모두 한 번에)
@@ -70,17 +70,50 @@ kebab doctor
| 명령 | 동작 |
|------|------|
| `kebab init` | XDG 경로에 데이터 디렉토리 + config.toml 생성 |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale |
| `kebab ingest [<path>]` | Markdown / 이미지 / PDF / Rust 소스코드 색인 (idempotent). TTY 에서는 stderr 진행 바, non-TTY (CI / pipe) 는 stderr 한 줄씩, `--json` 은 stdout 에 `ingest_progress.v1` 라인 streaming 후 마지막에 `ingest_report.v1`. Ctrl-C 한 번이면 현재 asset 마무리 후 abort (부분 commit 보존, idempotent re-run), 두 번째 Ctrl-C 는 hard exit. Markdown title 이 frontmatter 에 없어도 첫 H1 → H2 → 첫 paragraph 80 자 → 파일명 순으로 자동 채움 (parser_version `md-frontmatter-v2`) — 기존 색인된 doc 도 다음 ingest 에서 새 title 로 갱신. **Incremental** (p9-fb-23): 두 번째 이후의 ingest 는 변하지 않은 doc (blake3 + parser/chunker/embedder version 모두 동일) 의 parse/chunk/embed/vector upsert 를 자동 스킵. final summary 에 `N unchanged` 카운트 표시. `--force-reingest` 로 skip 무시 강제 재처리. **지원 형식** (extractor 자동 결정 — config 에 명시 불가): Markdown (`.md`), 이미지 (`.png` / `.jpg` / `.jpeg`, OCR + caption), PDF (`.pdf`), **소스코드** (`.rs``code-rust-ast-v1`, `.py``code-python-ast-v1`, `.ts`/`.tsx``code-ts-ast-v1`, `.js`/`.mjs`/`.cjs`/`.jsx``code-js-ast-v1` — 모두 tree-sitter AST chunker). 다른 확장자는 자동 skip — `IngestItem.warnings` 에 사유 (`"unsupported media type: .docx"` 등), `IngestReport.skipped_by_extension` 에 카운트 분류, CLI / TUI summary 에 breakdown 표시. 코드 chunk 는 `citation.kind = "code"``citation.lang = "<lang>"` + `symbol` + line range 를 담고, SearchHit top-level 에 `code_lang` + `repo` (`.git/` walk-up 의 디렉토리 이름) 가 backfill 됨. `--code-lang rust` / `--code-lang python` / `--code-lang typescript` / `--code-lang javascript` / `--media code` filter 로 언어별·코드 전용 검색 가능 (p10-1A-1 filter flags). Python symbol 은 workspace 경로 → dotted module path prefix (예: `kebab_eval.metrics.compute_mrr`), TS/JS symbol 은 slash-style module path prefix (예: `src/Foo.Foo.search`). |
| `kebab search --mode {lexical,vector,hybrid} "<query>" [--no-cache] [--max-tokens N] [--snippet-chars N] [--cursor <opaque>] [--tag T] [--lang L] [--path-glob G] [--trust-min LEVEL] [--media TYPE] [--ingested-after RFC3339] [--doc-id ID] [--trace] [--bulk] [--repo NAME ...] [--code-lang LIST]` | 검색. hybrid는 RRF fusion, citation 포함. 같은 process 안에서 동일 query (NFKC + trim + lowercase 정규화) 반복 시 in-process LRU 캐시 hit (capacity = `[search] cache_capacity`, default 256). `--no-cache` 로 강제 bypass — 디버깅용. ingest commit 발생 시 `kv['corpus_revision']` bump 으로 모든 entry 자동 stale. **`--max-tokens` / `--snippet-chars` / `--cursor` (p9-fb-34)** — agent budget controls. `--json` 출력은 `search_response.v1` wrapper (`{hits, next_cursor, truncated}`) — pre-fb-34 의 bare array 와 호환 안 됨. mismatched cursor → `error.v1.code = stale_cursor`. **filter flags (p9-fb-36):** `--tag` 는 반복 가능 flag (`--tag rust --tag async`) 로 OR 매칭, `--media``,` 구분 다중 값 OR 매칭, 나머지 flags 간은 AND 조합. `--trust-min``primary\|secondary\|generated` 중 하나 (해당 level 이상 포함). `--ingested-after` 는 RFC3339 UTC — 파싱 실패 시 `error.v1.code = config_invalid` (exit 2). `--media md``markdown` alias 로 정규화. 알 수 없는 `--media` 값은 무조건 empty hits (오류 아님). **`--trace` (p9-fb-37)** — `search_response.v1.trace` 에 lexical / vector pre-fusion 후보 + RRF union + per-stage timing (`lexical_ms` / `vector_ms` / `fusion_ms` / `total_ms`) 노출. trace 요청은 캐시 우회 (`--no-cache` 없이도 항상 cold). **`--bulk` (p9-fb-42)** — stdin ndjson 으로 N query 한 번에 실행. `--json` 면 stdout per-query ndjson (`bulk_search_item.v1`) + stderr summary (`bulk_summary: total=N succeeded=S failed=F`). Cap 100. agent 가 query decomposition 후 sub-query 일괄 실행 시 single round-trip — App instance 재사용으로 캐시 / embedder cold-start 비용 한 번만. Per-query failure 는 item 의 `error` (error.v1) 에 격리, 다른 query 계속 진행. **code corpus filters (p10-1A-1):** `--repo` 는 반복 가능 (`--repo kebab --repo other`) OR 매칭. `--code-lang` 는 반복 또는 comma 다중 값 (`--code-lang rust,python`), 알 수 없는 값은 빈 hits. `--media code` 는 Tier 1/2/3 모든 code chunk 포함. 1A-1 시점에서는 indexed 된 code chunk 가 없어 filter 가 항상 빈 결과 — 1A-2 (Rust AST chunker) 머지 이후 실효. |
| `kebab list docs` | 색인된 문서 목록 |
| `kebab inspect doc <id>` / `kebab inspect chunk <id>` | raw record 보기 |
| `kebab ask "<query>" [--show-citations / --hide-citations] [--session <id>]` | RAG 답변 + 근거 인용. 답변 후 `근거:` block 으로 full path / line range / score 한 줄씩 (default ON — `--hide-citations` 로 끄기, pipe 시 유용). 근거 부족 시 거절. Ollama 필요. `--session <id>` 로 multi-turn — 첫 호출에서 SQLite `chat_sessions` 에 자동 생성, 이후 호출은 prior turns 를 history 로 받아 follow-up. session id 는 사용자 지정 (e.g. `kb-rust-async-2026-05`) — `kebab reset --data-only` 로 모든 session wipe |
| `kebab fetch chunk <id> [--context N]` / `kebab fetch doc <id> [--max-tokens N]` / `kebab fetch span <doc_id> <ls> <le> [--max-tokens N]` | (p9-fb-35) verbatim text fetch from indexed corpus. wire = `fetch_result.v1` (kind discriminator). chunk: target + ±N ordinal-context chunks. doc: full normalized markdown. span: 1-based line range (PDF/audio rejected as `error.v1.code = span_not_supported`). chars/4 budget on doc/span. |
| `kebab ask "<query>" [--show-citations / --hide-citations] [--session <id>] [--stream]` | RAG 답변 + 근거 인용. 답변 후 `근거:` block 으로 full path / line range / score 한 줄씩 (default ON — `--hide-citations` 로 끄기, pipe 시 유용). 근거 부족 시 거절. Ollama 필요. `--session <id>` 로 multi-turn — 첫 호출에서 SQLite `chat_sessions` 에 자동 생성, 이후 호출은 prior turns 를 history 로 받아 follow-up. session id 는 사용자 지정 (e.g. `kb-rust-async-2026-05`) — `kebab reset --data-only` 로 모든 session wipe. **`--stream` (p9-fb-33)** 로 ndjson `answer_event.v1` event (retrieval_done → token* → final) 를 stderr 에 흘리고 stdout 마지막 줄에 기존 `answer.v1` — agent 가 token 즉시 소비 가능 |
| `kebab doctor` | 설정/모델/DB 헬스 체크 |
| `kebab tui` | Ratatui 셸 (Library + Search + Ask + Inspect 패널, desktop 진행 중). Library 에서 `r` 키로 background ingest 시작 — 화면 하단 status bar 가 진행 표시, 완료/abort 시 final 라인 잠시 유지 후 자동 hide. ingest 진행 중 `Esc` / `Ctrl-C` 가 cancel signal (그 외에는 quit). vim-style mode (header 우측 `-- NORMAL --` / `-- INSERT --`) — Library/Inspect 는 자동 NORMAL, Search/Ask 는 자동 INSERT. `i` 로 Normal→Insert (모든 pane — p9-fb-21), `Esc` 로 Insert→Normal 어디서나. mode-authoritative dispatch — Search 의 `j/k/o/g`, Ask 의 `e/j/k` 는 NORMAL 모드에서만 명령으로 동작, INSERT 에서는 입력 문자로 typing. (Search 의 chunk inspect 키는 `i``o` 로 rebind — `i` 가 universal Insert toggle.) **`F1` 로 cheatsheet popup** (현재 pane 의 키 매핑 + global 토글 표) — `Esc` / `F1` 로 닫기. Search 패널은 200ms debounce 후 background worker 가 검색 — 키 입력으로 UI freeze 안 됨, 사용자가 계속 타이핑하면 stale 결과 자동 폐기 (generation counter). Ask 패널은 multi-turn — 같은 conversation 안에서 Q1/A1, Q2/A2 transcript 누적, 다음 질문이 이전 턴을 history 로 받아 답변. 답변 본문은 markdown 렌더 (bold/italic/inline code/heading/list/code fence/table/blockquote, raw `**bold**` 가 실제 굵게 표시). `Ctrl-L` 로 새 conversation 시작. Search 의 `g` 키가 `$EDITOR` (기본 `vi`) 로 hit 의 citation 위치 열기 — 종료 후 TUI 화면이 자동으로 깨끗이 redraw. CLI `kebab ask` 는 raw markdown 그대로 (terminal 호환성 위해). Library 의 doc-list 가 한글 / 일본어 / 중국어 (CJK) 제목을 wide-char 정확한 column width 로 truncate — 한글 제목이 한 줄을 넘기지 않음 (CJK 1 자 = 2 col). Search/Ask/Filter 입력의 cursor 가 wide char 위에서 column 단위로 정렬 — 한글 입력 시 caret 이 글자 옆에 정확히 놓임. `← / →` 로 입력 문자열 중간 cursor 이동 (한글 한 글자 = 2 column 이라도 한 번에 이동), `Home / End` 로 양 끝 점프, `Delete` 로 cursor 위치 char 삭제 — 모든 input pane (Ask / Search / Library filter overlay) 동일 (p9-fb-22). Ask 트랜스크립트는 새 답변이 viewport 아래로 누적될 때 자동으로 tail 을 따라감 (auto-scroll); `j` / `k` 로 위로 스크롤하면 freeze, `Shift-G` 로 다시 bottom + auto-tail 재개. 화면 하단 hint line 은 한국어 동사구로 (`"위로"` / `"아래로"` / `"필터"` / `"타이핑 검색어"` / `"Esc 로 NORMAL 모드"` / `"i 입력모드"` 등) + 현재 (pane, mode) 조합에 맞춰 자동 분기, **첫 fragment 가 항상 `F1 도움말`** (cheatsheet 발견성 보장). 모든 모드에서 항상 떠 있는 상태바 — `kebab v<version> │ <pane> │ <docs> docs │ <state>` (state: streaming/searching/indexing/idle, ingest 진행 중에는 progress 가 같은 자리에 흡수됨). Ask 진입 시 conversation id 8 자 prefix 도 함께 표시. Ask 트랜스크립트와 Inspect 양쪽에서 `PgUp / PgDn` 으로 10 줄씩 페이지 스크롤. Library 의 doc list 위에는 `TITLE / TAGS / UPDATED / CHUNKS` 컬럼 헤더 행 표시 (display-width 정렬, Hangul / CJK 안전). |
| `kebab reset [--all / --data-only / --vector-only / --config-only] [--yes]` | XDG 데이터 wipe. **Irreversible.** TTY 면 confirm prompt, 아니면 `--yes` 필수. `--vector-only` 는 SQLite `embedding_records` 도 함께 truncate (orphan 방지) |
| `kebab eval run / compare` | golden query 회귀 측정 |
| `kebab schema [--json]` | introspection — wire schemas / capabilities / models / stats 한 번에. `--json``schema.v1` wire; 사람 모드는 서식 출력. **stats 에 (p9-fb-37) `media_breakdown` (5 keys: markdown / pdf / image / audio / other) + `lang_breakdown` (BCP-47 코드, NULL 은 literal `"null"`) + `index_bytes` (sqlite + lancedb on-disk 합계) + `stale_doc_count` (`config.search.stale_threshold_days` 초과 doc 수) 추가.** |
| `kebab ingest-file <path>` | 단일 파일 ingest (workspace 외부 가능). 바이트는 `<workspace.root>/_external/<hash12>.<ext>` 로 copy. `.kebabignore` 매치 시 stderr warn 후 진행 (explicit ingest 가 bypass intent). |
| `kebab ingest-stdin --title <T> [--source-uri <URI>]` | stdin 의 markdown 본문 ingest. frontmatter (title + source_uri) 자동 prepend. v1 markdown only. |
| `kebab mcp` | MCP (Model Context Protocol) stdio server. agent host (Claude Code / Cursor / OpenAI Agents) 가 spawn 하여 tool 호출 (`search` / `bulk_search` / `ask` / `fetch` / `schema` / `doctor` / `ingest_file` / `ingest_stdin`). `--config` honor. |
모든 명령에 `--json` 플래그. 출력은 frozen wire schema v1 (`schema_version` 항상 포함, 예: `ingest_report.v1`, `ingest_progress.v1`, `search_hit.v1`, `answer.v1`, `doctor.v1`, `reset_report.v1`).
모든 명령에 `--json` 플래그. 출력은 frozen wire schema v1 (`schema_version` 항상 포함, 예: `ingest_report.v1`, `ingest_progress.v1`, `search_hit.v1`, `answer.v1`, `doctor.v1`, `reset_report.v1`, `schema.v1`). `--json` 모드에서 fatal error 는 stderr 에 `error.v1` ndjson 으로 emit (exit code 0/1/2/3 unchanged).
글로벌 플래그: `--readonly` (또는 `KEBAB_READONLY=1`) — 모든 write-path 명령 (`ingest` / `ingest-file` / `ingest-stdin` / `reset`) 을 비활성화, exit 1. `--quiet` — 진행 바 / hint 등 human-readable stderr 억제 (exit code / stdout 출력은 그대로). `KEBAB_PROGRESS=plain` — TTY 가 없는 환경에서도 진행 상황을 plain-text 한 줄씩 stderr 로 출력 (spinner 대신).
### Score 해석 (fb-38)
`search_hit.v1.score`**ranking signal** 이지 confidence 가 아니다. `score_kind` 필드로 의미 선언:
| `score_kind` | 의미 | 범위 |
|--------------|------|------|
| `rrf` (hybrid) | RRF normalized | `[0, 1]`, ceiling = 1.0 (양 채널 rank=1) |
| `bm25` (lexical) | raw BM25 | unbounded (≥ 0) |
| `cosine` (vector) | cosine sim | `[-1, 1]` |
#### RRF 수식 (hybrid mode)
```
chunk c 의 raw RRF = Σ_m 1 / (k_rrf + rank_m(c))
여기서 m ∈ {lexical, vector}, k_rrf = config.search.rrf_k (default 60).
양 채널 모두 rank=1 일 때 raw RRF = 2 / (k_rrf + 1) ≈ 0.0328.
normalize: rrf_score = raw_rrf / (2 / (k_rrf + 1))
→ rrf_score ∈ [0, 1]. 양쪽 rank=1 → 1.0, 한 쪽만 등장 → ≈ 0.5 천장.
```
`rrf_score = 0.5` 의 의미: chunk 가 한 채널 (lexical 또는 vector) 에서만 rank 1 로 등장. confidence 50% 가 아님 — RRF 수식의 산술적 천장.
agent 가 trust threshold 가 필요하면 top-level `score` 가 아닌 nested `retrieval.lexical_score` (BM25 raw) / `retrieval.vector_score` (cosine raw) 사용.
## 논리 아키텍처
@@ -98,9 +131,9 @@ flowchart TB
end
subgraph Pipeline["도메인 + 파이프라인"]
parse["parse-md / parse-pdf / parse-image"]
chunker["chunker (md-heading-v1, pdf-page-v1)"]
embedder["embedder (fastembed multilingual-e5-small)"]
parse["parse-md / parse-pdf / parse-image / parse-code"]
chunker["chunker (md-heading-v1, pdf-page-v1, code-rust-ast-v1, code-python-ast-v1, code-ts-ast-v1, code-js-ast-v1)"]
embedder["embedder (fastembed multilingual-e5-large)"]
retriever["retriever (lexical / vector / hybrid RRF)"]
rag["RAG pipeline"]
end
@@ -145,7 +178,18 @@ flowchart TB
## Configuration
- `~/.config/kebab/config.toml``kebab init` 가 XDG 경로에 생성. `[workspace] include`, `[storage]`, `[chunking]`, `[models.embedding]`, `[models.llm]`, `[image.ocr]`, `[image.caption]`, `[search]`, `[rag]`, `[ui]` 절. `[ui] theme = "dark" | "light"` 로 TUI 팔레트 선택 (default `"dark"`, 알 수 없는 값은 dark fallback).
- `~/.config/kebab/config.toml``kebab init` 가 XDG 경로에 생성. `[workspace]` (root, exclude — include 필드는 제거됨, 지원 형식은 자동 결정), `[storage]`, `[chunking]`, `[models.embedding]`, `[models.llm]`, `[image.ocr]`, `[image.caption]`, `[search]`, `[rag]`, `[ui]` 절.
- `[models.embedding]`
- `model` (default `"multilingual-e5-large"`, fb-39b) — 다국어 sentence embedding 모델. 1024-dim. ONNX (~1.3 GB) 첫 실행 시 fastembed cache (`config.storage.model_dir/fastembed/`) 에 자동 다운로드. `"multilingual-e5-small"` (384 dim) 는 backwards-compat 으로 사용 가능 — TOML 에 명시.
- `dimensions` (default `1024`) — 모델의 embedding 차원. config 와 LanceDB stored dim 불일치 시 검색 결과 0 건 (orphan table). 모델 변경 시 `kebab reset --vector-only && kebab ingest` 로 vector index 재구축 권장.
- `[ui] theme = "dark" | "light"` 로 TUI 팔레트 선택 (default `"dark"`, 알 수 없는 값은 dark fallback).
- `[search] stale_threshold_days = 30` (p9-fb-32) — search hit / RAG citation 의 `stale` 플래그 기준 (default 30 일, `0` 으로 비활성화). 옛 config 의 `workspace.include = [...]` 은 silently 무시 + 단발 deprecation warning (p9-fb-25).
- `[ingest.code]` (p10-1A-1) — code ingest 의 skip 정책 + chunker 기본값.
- `skip_generated_header = true` — 첫 ~512 byte 의 generated marker (`@generated` / `DO NOT EDIT` 등) 감지 시 skip.
- `max_file_bytes = 262144` (256 KiB) / `max_file_lines = 5000` — 파일당 cap, 초과 시 skip.
- `extra_skip_globs = []` — 사용자 추가 skip 패턴 (`.gitignore` 문법).
- `.gitignore` honor: 자동 적용. `.kebabignore` 는 추가 layer. 우선순위: built-in safety net (`node_modules/` / `target/` / `__pycache__/` / `.venv/` / `venv/` / `env/`) > `.gitignore` > `.kebabignore`.
- `[rag] prompt_template_version` (default `"rag-v2"`) — RAG system prompt version. `"rag-v1"` 은 legacy backwards-compat (사용자 명시 시 유지). v2 강화 규칙: (1) fact 인용 시 [#번호] 앞에 chunk 속 원문 큰따옴표 표기, (2) 학습 지식 동원 금지, (3) 근거 모호 시 "확실하지 않다" 명시.
- `--config <path>` flag — 임시 워크스페이스 / 격리 테스트 시 사용. CLI / TUI 모두 honor.
- `KEBAB_*` env — 일부 키 override (`KEBAB_RAG_SCORE_GATE`, `KEBAB_EVAL_GOLDEN`, `KEBAB_COMMIT_HASH` 등).
- XDG layout: `~/.config/kebab/`, `~/.local/share/kebab/`, `~/.cache/kebab/`, `~/.local/state/kebab/`.
@@ -157,10 +201,30 @@ config 예시는 [docs/SMOKE.md](docs/SMOKE.md) 의 `/tmp/kebab-smoke/config.tom
`--json` 출력 + frozen wire schema v1 가 stable contract. 통합 옵션:
- **Claude Code / Codex skill** — `kebab search --json` / `kebab ask --json` 호출하는 ~50줄 wrapper. multi-turn 은 `kebab ask --session <id> --json` 으로 영속 — wrapper 가 conversation id 관리하면 외부 agent 도 `--repl` 없이 stateful 대화 가능 (p9-fb-18).
- **MCP server** — stdio JSON-RPC 로 `kebab-app` facade 1:1 노출.
- **Claude Code skill** — repo 의 [`integrations/claude-code/`](integrations/claude-code/) 가 ship-ready skill. `cp -r integrations/claude-code/kebab ~/.claude/skills/` 한 번이면 새 Claude Code 세션부터 자동 trigger (내부 시스템 / 위키 lookup / 사내 runbook 질문). multi-turn 은 `kebab ask --session <id> --json` 으로 영속 — skill 이 conversation id 관리하면 외부 agent 도 `--repl` 없이 stateful 대화 가능 (p9-fb-18).
- **Codex / 기타 agent host** — `--json` + frozen wire schema v1 가 stable contract. 동일 패턴으로 ~50줄 wrapper 작성 가능. `integrations/<host>/` 에 추가 PR 환영.
- **MCP server** — stdio JSON-RPC 로 `kebab-app` facade 1:1 노출. `kebab mcp` 참조.
- **HTTP wrapper** — `kebab serve --bind 127.0.0.1:7711` (P+, local-only 가치 신중).
## MCP 사용
`kebab mcp` 가 stdio MCP server. 8 tool: `search` / `bulk_search` (p9-fb-42 — N query 한 번에) / `ask` / `fetch` (p9-fb-35) / `schema` / `doctor` / `ingest_file` / `ingest_stdin`.
Claude Code 빠른 등록 (`~/.claude/mcp.json` 또는 host 동등 위치):
```json
{
"mcpServers": {
"kebab": {
"command": "kebab",
"args": ["mcp"]
}
}
}
```
자세한 사용법 (Cursor / OpenAI Agents / Copilot CLI config, per-tool 입출력 예시, troubleshooting, multi-turn ask + session 관리, performance / security) — **[docs/mcp-usage.md](docs/mcp-usage.md)** 참조.
## 비-목표
다중 사용자 SaaS / K8s / 원격 vector DB / enterprise RBAC / 실시간 협업 / 모든 파일 포맷의 완벽한 parsing / agent 임의 파일 수정 / multi-workspace / LLM-as-judge eval / CLIP 시각 embedding / `kebab://` protocol handler — frozen 설계 §11 / §0 참조.

View File

@@ -32,6 +32,10 @@ kebab-parse-image = { path = "../kebab-parse-image" }
# per-asset dispatch (see `ingest_one_asset` PDF branch) and runs the
# resulting `CanonicalDocument` through `kebab-chunk::PdfPageV1Chunker`.
kebab-parse-pdf = { path = "../kebab-parse-pdf" }
# p10-1A-2: Rust AST extractor lives here. App threads it into the
# per-asset dispatch (see `ingest_one_asset` Code branch) and runs the
# resulting `CanonicalDocument` through `kebab-chunk::CodeRustAstV1Chunker`.
kebab-parse-code = { path = "../kebab-parse-code" }
anyhow = { workspace = true }
blake3 = { workspace = true }
serde = { workspace = true }
@@ -49,6 +53,11 @@ lru = { workspace = true }
# `" foo "` collapse to one entry. Same crate kebab-normalize +
# kebab-core already use, no version drift.
unicode-normalization = "0.1"
# p9-fb-31: GitignoreBuilder for .kebabignore matching in ingest_file_with_config.
# Same version as kebab-source-fs (0.4) to avoid duplicate dep versions.
ignore = "0.4"
# p9-fb-34: opaque pagination cursor encodes payload as base64.
base64 = { workspace = true }
[dev-dependencies]
rusqlite = { workspace = true }
@@ -64,3 +73,6 @@ image = { version = "0.25", default-features = false, features =
# to the same major (0.32) so byte output is identical between the two
# fixture surfaces.
lopdf = "0.32"
# error_wire::tests::llm_unreachable_classifies_to_model_unreachable needs a real
# reqwest::Error (private constructor) — built from a connect-refused call.
reqwest = { version = "0.12", default-features = false, features = ["blocking", "rustls-tls"] }

View File

@@ -40,8 +40,8 @@ use anyhow::{Context, Result, anyhow};
use lru::LruCache;
use kebab_core::{
Answer, Embedder, IndexVersion, LanguageModel, Retriever, SearchHit, SearchMode,
SearchQuery, VectorStore,
Answer, DocumentStore, Embedder, IndexVersion, LanguageModel, Retriever, SearchHit,
SearchMode, SearchOpts, SearchQuery, VectorStore,
};
use kebab_embed_local::FastembedEmbedder;
use kebab_llm_local::OllamaLanguageModel;
@@ -50,6 +50,31 @@ use kebab_search::{HybridRetriever, LexicalRetriever, VectorRetriever};
use kebab_store_sqlite::SqliteStore;
use kebab_store_vector::LanceVectorStore;
/// p9-fb-34: top-level wrapper around a paginated, budget-limited
/// search result. Mirrors the wire `search_response.v1` shape.
///
/// `next_cursor` is non-null whenever more hits may be reachable —
/// either the retriever filled the page (more behind it), or the
/// budget loop popped hits (those popped hits remain fetchable
/// from `offset + returned`). It is null only when the retriever
/// returned fewer hits than requested AND nothing was popped — i.e.
/// the corpus has nothing more for this query.
///
/// `truncated` is independent of `next_cursor`: it signals that
/// the budget loop modified the page (snippet shorten or k pop).
/// Caller may either widen `max_tokens` (and re-issue the same
/// query) or follow `next_cursor` (to advance through more hits)
/// or both.
#[derive(Clone, Debug)]
pub struct SearchResponse {
pub hits: Vec<SearchHit>,
pub next_cursor: Option<String>,
pub truncated: bool,
/// p9-fb-37: present when caller passed `SearchOpts.trace = true`.
/// Consumers that ignore trace should leave this `None`.
pub trace: Option<kebab_core::SearchTrace>,
}
/// Facade state — see module docs for lifetime rules.
///
/// The struct is public so long-lived callers (kb-eval, the future P9
@@ -190,7 +215,21 @@ impl App {
corpus_revision = key.corpus_revision,
"search served from LRU cache"
);
return Ok(hits.clone());
// p9-fb-32: re-stamp staleness on every cache hit. The cache
// entry was stamped at insert time against an older `now`
// and an older threshold; if either has shifted (config
// reload, time passing) the cached `stale: false` may now
// be wrong. Re-stamping is cheap (per-hit comparison) and
// avoids invalidating the cache on threshold changes.
let mut hits = hits.clone();
drop(guard);
let now = time::OffsetDateTime::now_utc();
crate::staleness::mark_stale_in_place(
&mut hits,
now,
self.config.search.stale_threshold_days,
);
return Ok(hits);
}
// Drop the lock before the (potentially slow) retriever call
// so other in-flight searches can use the cache concurrently.
@@ -205,14 +244,14 @@ impl App {
/// Used by `--no-cache` CLI invocations and by `search` itself
/// on cache miss. Identical behavior to the pre-fb-19 `search`.
pub fn search_uncached(&self, query: SearchQuery) -> Result<Vec<SearchHit>> {
match query.mode {
let mut hits = match query.mode {
SearchMode::Lexical => {
let lex = LexicalRetriever::with_settings(
self.sqlite.clone(),
lexical_index_version(&self.config),
self.config.search.snippet_chars,
);
lex.search(&query)
lex.search(&query)?
}
SearchMode::Vector => {
let (emb, vec_store) = self.require_embeddings()?;
@@ -226,7 +265,7 @@ impl App {
vec_iv,
self.config.search.snippet_chars,
);
retr.search(&query)
retr.search(&query)?
}
SearchMode::Hybrid => {
let lex = Arc::new(LexicalRetriever::with_settings(
@@ -246,9 +285,232 @@ impl App {
self.config.search.snippet_chars,
)) as Arc<dyn Retriever>;
let hybrid = HybridRetriever::new(&self.config, lex, vec_retr);
hybrid.search(&query)
hybrid.search(&query)?
}
};
// p9-fb-32: stamp staleness against the freshest possible `now`
// and the current threshold. Cheap (per-hit comparison).
let now = time::OffsetDateTime::now_utc();
crate::staleness::mark_stale_in_place(
&mut hits,
now,
self.config.search.stale_threshold_days,
);
// p10-1A-2: backfill `code_lang` from the Citation::Code `lang`
// field. The search layer (kebab-search) constructs SearchHit with
// `code_lang: None`; we own the post-processing here in kebab-app
// and can fill it cheaply from data already present in the hit.
backfill_code_lang(&mut hits);
// p10-1A-2 Task 8b: backfill `repo` from the document's
// `Metadata.repo`. Unlike `code_lang`, this cannot be derived from
// the Citation alone — it requires a store lookup by `doc_id`.
self.backfill_repo(&mut hits);
Ok(hits)
}
/// p9-fb-34: budget-aware search facade. Returns hits trimmed to
/// `opts.max_tokens` (chars/4 approximation) plus pagination
/// metadata. `App::search` is now a thin wrapper that drops the
/// metadata for backwards compat.
///
/// `SearchResponse.next_cursor` and `truncated` are independent
/// signals — see `SearchResponse` doc for details.
pub fn search_with_opts(
&self,
query: SearchQuery,
opts: SearchOpts,
) -> Result<SearchResponse> {
use crate::cursor;
let corpus_revision = self.sqlite.corpus_revision().to_string();
let offset = match opts.cursor.as_ref() {
// p9-fb-34: wrap the typed ErrorV1 in StructuredError so
// anyhow carries the structured payload all the way to
// `classify` — string formatting here would degrade
// `code = "stale_cursor"` to `code = "generic"` on the wire.
Some(c) => cursor::decode(c, &corpus_revision)
.map_err(|e| anyhow::Error::new(crate::error_wire::StructuredError(e)))?,
None => 0,
};
let snippet_chars = opts
.snippet_chars
.unwrap_or(self.config.search.snippet_chars);
// Fetch enough to satisfy offset + the requested page. The
// retriever returns at most `fetch_k` hits — we then drop
// `offset` and keep the next `k_effective`. `k = 0` is
// treated as "use config default" so a caller passing through
// a default-constructed `SearchQuery` still gets useful work
// out of the budget facade.
let k_effective = if query.k == 0 {
self.config.search.default_k
} else {
query.k
};
let fetch_k = offset.saturating_add(k_effective);
let fetch_query = SearchQuery {
k: fetch_k,
..query.clone()
};
// p9-fb-37: when --trace is requested, bypass the LRU cache and
// run through `HybridRetriever::search_with_trace`, which
// dispatches by mode internally. Vector / hybrid modes require
// embeddings (same as `--mode hybrid`); lexical mode skips
// embedder construction via `NoopRetriever` so lexical-only
// workspaces (provider = "none") can use `--trace` without
// surfacing the "switch to --mode lexical" error.
if opts.trace {
let lex = Arc::new(LexicalRetriever::with_settings(
self.sqlite.clone(),
lexical_index_version(&self.config),
self.config.search.snippet_chars,
)) as Arc<dyn Retriever>;
let vec_retr: Arc<dyn Retriever> = if matches!(query.mode, SearchMode::Lexical) {
// `HybridRetriever::search_with_trace` never invokes the
// vector retriever for `SearchMode::Lexical` (Task 4).
// A no-op stand-in lets us avoid the ~470 MB embedder
// load when the user only asked for lexical trace.
Arc::new(NoopRetriever)
} else {
let (emb, vec_store) = self.require_embeddings()?;
let vec_iv = vector_index_version(emb.as_ref());
let vec_dyn: Arc<dyn VectorStore + Send + Sync> = vec_store;
let emb_dyn: Arc<dyn Embedder> = emb;
Arc::new(VectorRetriever::with_settings(
vec_dyn,
emb_dyn,
self.sqlite.clone(),
vec_iv,
self.config.search.snippet_chars,
)) as Arc<dyn Retriever>
};
let hybrid = HybridRetriever::new(&self.config, lex, vec_retr);
let (mut traced_hits, trace) = hybrid.search_with_trace(&fetch_query)?;
// Stamp staleness — same as search_uncached.
let now = time::OffsetDateTime::now_utc();
crate::staleness::mark_stale_in_place(
&mut traced_hits,
now,
self.config.search.stale_threshold_days,
);
// p10-1A-2: backfill code_lang — same as search_uncached.
backfill_code_lang(&mut traced_hits);
// p10-1A-2 Task 8b: backfill repo — same as search_uncached.
self.backfill_repo(&mut traced_hits);
// Apply offset + k_effective truncation (mirrors non-trace path).
let drop_n = offset.min(traced_hits.len());
traced_hits.drain(..drop_n);
let mut hits: Vec<SearchHit> =
traced_hits.into_iter().take(k_effective).collect();
// Snippet truncation if opts.snippet_chars set (mirror non-trace path).
if opts.snippet_chars.is_some() {
for h in hits.iter_mut() {
if h.snippet.chars().count() > snippet_chars {
h.snippet = trim_to_chars(&h.snippet, snippet_chars);
}
}
}
// Trace path skips the budget loop. Caller will inspect
// `hits.len()` and `trace.timing` rather than paginate.
return Ok(SearchResponse {
hits,
next_cursor: None,
truncated: false,
trace: Some(trace),
});
}
// backfill_code_lang + backfill_repo are applied inside `search`
// via `search_uncached` — no explicit call needed here. Trace
// branch above calls them directly because it bypasses `search`.
let mut all_hits = self.search(fetch_query)?;
// Skip offset.
let drop_n = offset.min(all_hits.len());
all_hits.drain(..drop_n);
let mut hits: Vec<SearchHit> =
all_hits.into_iter().take(k_effective).collect();
// Apply snippet_chars override if shorter than what the
// retriever returned (retriever already honored
// `config.search.snippet_chars`; this only kicks in when the
// caller asked for *less*).
if opts.snippet_chars.is_some() {
for h in hits.iter_mut() {
if h.snippet.chars().count() > snippet_chars {
h.snippet = trim_to_chars(&h.snippet, snippet_chars);
}
}
}
// Budget loop.
let mut truncated = false;
if let Some(max_tokens) = opts.max_tokens {
let max_chars = max_tokens.saturating_mul(4);
// Step 1: shorten snippets progressively to a 60-char floor.
const SNIPPET_FLOOR: usize = 60;
let mut current_snippet_cap = snippet_chars;
while estimate_chars(&hits) > max_chars
&& current_snippet_cap > SNIPPET_FLOOR
{
current_snippet_cap =
(current_snippet_cap / 2).max(SNIPPET_FLOOR);
for h in hits.iter_mut() {
if h.snippet.chars().count() > current_snippet_cap {
h.snippet =
trim_to_chars(&h.snippet, current_snippet_cap);
truncated = true;
}
}
}
// Step 2: pop hits from the end until we fit, but always
// keep ≥ 1.
while estimate_chars(&hits) > max_chars && hits.len() > 1 {
hits.pop();
truncated = true;
}
}
// p9-fb-34: emit cursor whenever more hits may be reachable.
// Three cases produce a non-null cursor:
// (a) returned == k_effective: retriever filled the page; there
// may be more behind it. Speculative — next call may return
// an empty page if nothing remains.
// (b) truncated by k-pop: returned < k_effective because we
// popped hits to fit the budget. Those popped hits live at
// offset+returned..; next call (with same or wider budget)
// resumes from there.
// (c) truncated by snippet-only shrink: returned == k_effective,
// falls under (a). Cursor lets caller paginate; widening
// --max-tokens lets caller re-fetch fuller snippets at the
// same offset.
//
// No cursor when neither (a) nor (b) applies — i.e. the retriever
// returned fewer than k_effective AND we didn't pop. That means
// end of available results.
let returned = hits.len();
let next_cursor = if returned == k_effective || truncated {
if offset.saturating_add(returned) > 0 {
Some(cursor::encode(offset + returned, &corpus_revision))
} else {
None
}
} else {
None
};
Ok(SearchResponse {
hits,
next_cursor,
truncated,
trace: None,
})
}
/// Run a RAG `ask` against the configured retriever + LLM. Reuses
@@ -531,6 +793,58 @@ impl App {
}
}
/// p10-1A-2 Task 8b: back-fill `SearchHit.repo` from the originating
/// document's `Metadata.repo` for every hit whose `repo` field is
/// currently `None`. The search layer (kebab-search) constructs hits
/// with `repo: None` because it has no store access; we fill it here
/// in kebab-app post-retrieval via a per-distinct-`doc_id` store lookup.
///
/// Deduplication: a small `HashMap` accumulates the
/// `(doc_id → Option<String>)` mapping so each unique document is
/// fetched at most once. Search result sets are small (default k ≤ 20),
/// so the map overhead is negligible. A `None` entry is cached too
/// (document not found or no repo in metadata) to avoid re-querying.
///
/// Non-repo documents (markdown, PDF, plain text, code files outside a
/// git tree) correctly keep `repo: None` — `Metadata.repo` is already
/// `None` for those, so the assignment is a no-op.
fn backfill_repo(&self, hits: &mut [SearchHit]) {
use std::collections::HashMap;
use kebab_core::DocumentId;
// doc_id → Option<String> where None means "not found / no repo"
let mut cache: HashMap<DocumentId, Option<String>> = HashMap::new();
for hit in hits.iter_mut() {
if hit.repo.is_some() {
continue;
}
let repo_val = cache
.entry(hit.doc_id.clone())
.or_insert_with(|| {
// Deliberately non-aborting: a failed store lookup for
// one hit must not abort the whole search response. Log
// the error so it's observable rather than silently
// dropped (review #140 round 1).
match self.sqlite.get_document(&hit.doc_id) {
Ok(opt) => opt.and_then(|doc| doc.metadata.repo),
Err(e) => {
tracing::warn!(
target: "kebab-app",
doc_id = %hit.doc_id,
error = %e,
"backfill_repo: get_document failed; leaving hit.repo = None"
);
None
}
}
});
if let Some(r) = repo_val {
hit.repo = Some(r.clone());
}
}
}
/// Resolve the embedder + vector store, surfacing the user-friendly
/// "switch to --mode lexical" error when embeddings are disabled.
fn require_embeddings(
@@ -564,6 +878,24 @@ fn lexical_index_version(config: &kebab_config::Config) -> IndexVersion {
IndexVersion(format!("lex:{}", config.chunking.chunker_version))
}
/// p9-fb-37: stand-in for the vector retriever in the trace path when
/// `query.mode == SearchMode::Lexical`. `HybridRetriever::search_with_trace`'s
/// Lexical branch never calls `vector.search()`, so returning an empty
/// hit list here is safe and lets lexical-only workspaces (embedding
/// `provider = "none"`) use `--trace` without paying the ~470 MB
/// embedder load.
struct NoopRetriever;
impl Retriever for NoopRetriever {
fn search(&self, _q: &kebab_core::SearchQuery) -> anyhow::Result<Vec<kebab_core::SearchHit>> {
Ok(Vec::new())
}
fn index_version(&self) -> kebab_core::IndexVersion {
kebab_core::IndexVersion("noop:trace".into())
}
}
/// Compose a stable `IndexVersion` for the vector retriever. Tracks
/// `(embedding_model, embedding_version, dimensions)` so a model swap
/// flags drift via the existing index_version mismatch warning in
@@ -604,6 +936,49 @@ fn blake3_truncate(input: &str) -> u128 {
u128::from_be_bytes(buf)
}
/// p9-fb-34: trim `s` to at most `n` Unicode scalar chars. Cheap
/// alternative to a `.chars().take(n).collect::<String>()` pattern;
/// reserves capacity proportional to UTF-8 worst case (4 bytes / char)
/// so the inner push never re-allocates.
fn trim_to_chars(s: &str, n: usize) -> String {
if s.chars().count() <= n {
return s.to_string();
}
let mut out = String::with_capacity(n.saturating_mul(4));
for (i, c) in s.chars().enumerate() {
if i >= n {
break;
}
out.push(c);
}
out
}
/// p9-fb-34: estimate wire JSON char cost of the hit list. Returns 0
/// per-hit when serialization fails — a SearchHit serialization
/// failure is an invariant violation; we degrade gracefully (loop
/// terminates early) rather than panic in the budget loop.
fn estimate_chars(hits: &[SearchHit]) -> usize {
hits.iter()
.map(|h| serde_json::to_string(h).map(|s| s.len()).unwrap_or(0))
.sum()
}
/// p10-1A-2: back-fill `SearchHit.code_lang` from `Citation::Code.lang`
/// for every code hit in the list. The search layer (kebab-search)
/// constructs hits with `code_lang: None`; we fill it here in kebab-app
/// post-retrieval so callers see the correct language identifier without
/// requiring a second SQL query.
fn backfill_code_lang(hits: &mut [SearchHit]) {
for hit in hits.iter_mut() {
if let kebab_core::Citation::Code { lang, .. } = &hit.citation {
if hit.code_lang.is_none() {
hit.code_lang = lang.clone();
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
@@ -646,3 +1021,59 @@ mod tests {
assert_ne!(a, d, "different session_id → different hash");
}
}
#[cfg(test)]
mod tests_trace {
use super::*;
use kebab_core::{SearchMode, SearchOpts, SearchQuery};
fn open_app_with_temp_dir() -> (tempfile::TempDir, App) {
let dir = tempfile::tempdir().unwrap();
let mut cfg = kebab_config::Config::defaults();
cfg.storage.data_dir = dir.path().to_string_lossy().into_owned();
// Bring up migrations.
let store = kebab_store_sqlite::SqliteStore::open(&cfg).unwrap();
store.run_migrations().unwrap();
drop(store);
let app = App::open_with_config(cfg).unwrap();
(dir, app)
}
#[test]
fn search_response_trace_none_when_opts_trace_false() {
let (_dir, app) = open_app_with_temp_dir();
let q = SearchQuery {
text: "x".into(),
mode: SearchMode::Lexical,
k: 1,
filters: Default::default(),
};
let resp = app.search_with_opts(q, SearchOpts::default()).unwrap();
assert!(resp.trace.is_none());
}
#[test]
fn search_response_trace_some_when_opts_trace_true_lexical_mode() {
// Lexical mode doesn't require embeddings — the trace path
// builds HybridRetriever with a `NoopRetriever` stand-in for
// the vector side, since `HybridRetriever::search_with_trace`'s
// Lexical branch never invokes `vector.search()`. Default
// Config has embedding `provider = "none"`, and lexical-mode
// trace must succeed under that config (no embedder load).
let (_dir, app) = open_app_with_temp_dir();
let q = SearchQuery {
text: "x".into(),
mode: SearchMode::Lexical,
k: 1,
filters: Default::default(),
};
let opts = SearchOpts {
trace: true,
..Default::default()
};
let resp = app
.search_with_opts(q, opts)
.expect("lexical-mode trace must succeed without embeddings");
assert!(resp.trace.is_some(), "trace populated when opts.trace=true");
}
}

View File

@@ -0,0 +1,298 @@
//! p9-fb-42: bulk multi-query facade. Sequential for-loop reusing
//! one App instance so embedder cold-start + LRU cache amortize
//! across the N queries.
use anyhow::Context;
use kebab_core::{
BulkSearchItem, BulkSearchSummary, DocumentId, Lang, SearchFilters, SearchHit, SearchMode,
SearchOpts, SearchQuery, TrustLevel,
};
use serde_json::Value;
use crate::{App, SearchResponse};
/// Hard cap on items per bulk call. Documented in spec — agents that
/// hit this should batch-split.
pub const BULK_QUERIES_MAX: usize = 100;
/// p9-fb-42: bulk search facade. Returns `(items, summary)` always
/// — per-query failures embed `error.v1` JSON in the item rather
/// than aborting the bulk call. Returns `Err` only for input
/// validation failures (e.g. >100 queries).
#[doc(hidden)]
pub fn bulk_search_with_config(
config: kebab_config::Config,
raw_items: Vec<Value>,
) -> anyhow::Result<(Vec<BulkSearchItem>, BulkSearchSummary)> {
if raw_items.len() > BULK_QUERIES_MAX {
anyhow::bail!(
"queries: max {} items, got {}",
BULK_QUERIES_MAX,
raw_items.len()
);
}
let app = App::open_with_config(config).context("kebab-app: open for bulk_search")?;
let mut results: Vec<BulkSearchItem> = Vec::with_capacity(raw_items.len());
let mut succeeded: u32 = 0;
let mut failed: u32 = 0;
for raw in raw_items {
let item = run_one(&app, raw);
if item.error.is_some() {
failed += 1;
} else {
succeeded += 1;
}
results.push(item);
}
let summary = BulkSearchSummary {
total: succeeded + failed,
succeeded,
failed,
};
Ok((results, summary))
}
fn run_one(app: &App, raw: Value) -> BulkSearchItem {
let echo = raw.clone();
match parse_one(&raw) {
Ok((query, opts)) => match app.search_with_opts(query, opts) {
Ok(resp) => BulkSearchItem {
query: echo,
response: Some(serialize_search_response(&resp)),
error: None,
},
Err(e) => BulkSearchItem {
query: echo,
response: None,
error: Some(error_v1_json("retrieval_error", &format!("{e:#}"), None)),
},
},
Err(msg) => BulkSearchItem {
query: echo,
response: None,
error: Some(error_v1_json("invalid_input", &msg, None)),
},
}
}
/// Mirror of `kebab-cli::wire::wire_search_response` — `SearchResponse`
/// itself is not `Serialize`, so we build the `search_response.v1`-shaped
/// JSON manually. Each hit also gets `score` promoted from
/// `retrieval.fusion_score` per §2.2, matching the CLI wire layer.
fn serialize_search_response(r: &SearchResponse) -> Value {
let mut v = serde_json::json!({
"schema_version": "search_response.v1",
"hits": r.hits.iter().map(serialize_search_hit).collect::<Vec<_>>(),
"next_cursor": r.next_cursor,
"truncated": r.truncated,
});
if let Value::Object(ref mut map) = v {
let trace_v = match &r.trace {
Some(t) => serde_json::to_value(t).unwrap_or(Value::Null),
None => Value::Null,
};
map.insert("trace".to_string(), trace_v);
}
v
}
fn serialize_search_hit(h: &SearchHit) -> Value {
let mut v = serde_json::to_value(h).unwrap_or(Value::Null);
if let Value::Object(ref mut map) = v {
if let Some(Value::Object(retrieval)) = map.get("retrieval") {
if let Some(score) = retrieval.get("fusion_score").cloned() {
map.insert("score".to_string(), score);
}
}
map.insert(
"schema_version".to_string(),
Value::String("search_hit.v1".to_string()),
);
}
v
}
fn parse_one(raw: &Value) -> Result<(SearchQuery, SearchOpts), String> {
let obj = raw.as_object().ok_or("expected JSON object")?;
let text = obj
.get("query")
.and_then(|v| v.as_str())
.ok_or("missing required field: query")?
.to_string();
let mode = match obj.get("mode").and_then(|v| v.as_str()) {
None => SearchMode::Hybrid,
Some("hybrid") => SearchMode::Hybrid,
Some("lexical") => SearchMode::Lexical,
Some("vector") => SearchMode::Vector,
Some(other) => return Err(format!("invalid mode: {other:?}")),
};
let k = obj
.get("k")
.and_then(|v| v.as_u64())
.map(|n| n as usize)
.unwrap_or(0); // 0 → use config default in app
let trust_min = match obj.get("trust_min").and_then(|v| v.as_str()) {
None => None,
Some("primary") => Some(TrustLevel::Primary),
Some("secondary") => Some(TrustLevel::Secondary),
Some("generated") => Some(TrustLevel::Generated),
Some(other) => return Err(format!("invalid trust_min: {other:?}")),
};
let ingested_after = match obj.get("ingested_after").and_then(|v| v.as_str()) {
None => None,
Some(s) => Some(
time::OffsetDateTime::parse(s, &time::format_description::well_known::Rfc3339)
.map_err(|e| format!("invalid ingested_after RFC3339 {s:?}: {e}"))?,
),
};
let media: Vec<String> = obj
.get("media")
.and_then(|v| v.as_array())
.map(|arr| {
arr.iter()
.filter_map(|x| x.as_str().map(normalize_media_alias))
.collect()
})
.unwrap_or_default();
let tags_any: Vec<String> = obj
.get("tag")
.and_then(|v| v.as_array())
.map(|arr| {
arr.iter()
.filter_map(|x| x.as_str().map(String::from))
.collect()
})
.unwrap_or_default();
let lang = obj
.get("lang")
.and_then(|v| v.as_str())
.map(|s| Lang(s.to_string()));
let path_glob = obj
.get("path_glob")
.and_then(|v| v.as_str())
.map(String::from);
let doc_id = obj
.get("doc_id")
.and_then(|v| v.as_str())
.map(|s| DocumentId(s.to_string()));
let filters = SearchFilters {
tags_any,
lang,
path_glob,
trust_min,
media,
ingested_after,
doc_id,
repo: vec![],
code_lang: vec![],
};
let opts = SearchOpts {
max_tokens: obj
.get("max_tokens")
.and_then(|v| v.as_u64())
.map(|n| n as usize),
snippet_chars: obj
.get("snippet_chars")
.and_then(|v| v.as_u64())
.map(|n| n as usize),
cursor: obj.get("cursor").and_then(|v| v.as_str()).map(String::from),
trace: obj.get("trace").and_then(|v| v.as_bool()).unwrap_or(false),
};
Ok((
SearchQuery {
text,
mode,
k,
filters,
},
opts,
))
}
fn normalize_media_alias(s: &str) -> String {
match s.to_ascii_lowercase().as_str() {
"md" => "markdown".to_string(),
other => other.to_string(),
}
}
fn error_v1_json(code: &str, message: &str, hint: Option<&str>) -> Value {
serde_json::json!({
"schema_version": "error.v1",
"code": code,
"message": message,
"hint": hint,
})
}
#[cfg(test)]
mod tests {
use super::*;
fn open_temp() -> kebab_config::Config {
let dir = tempfile::tempdir().unwrap();
let mut cfg = kebab_config::Config::defaults();
cfg.storage.data_dir = dir.path().to_string_lossy().into_owned();
// Bring up migrations so SqliteStore::open_existing succeeds inside App::open.
let store = kebab_store_sqlite::SqliteStore::open(&cfg).unwrap();
store.run_migrations().unwrap();
drop(store);
// Leak the tempdir into a static — tests are short-lived; not worth threading.
std::mem::forget(dir);
cfg
}
#[test]
fn empty_input_returns_empty_summary() {
let cfg = open_temp();
let (items, summary) = bulk_search_with_config(cfg, vec![]).unwrap();
assert!(items.is_empty());
assert_eq!(summary.total, 0);
assert_eq!(summary.succeeded, 0);
assert_eq!(summary.failed, 0);
}
#[test]
fn over_cap_returns_err() {
let cfg = open_temp();
let raw: Vec<Value> = (0..101)
.map(|_| serde_json::json!({"query": "x"}))
.collect();
let err = bulk_search_with_config(cfg, raw).unwrap_err();
let msg = format!("{err:#}");
assert!(msg.contains("max 100"));
}
#[test]
fn invalid_item_emits_error_keeps_total_count() {
let cfg = open_temp();
let raw = vec![
serde_json::json!({"query": "ok", "mode": "lexical"}),
serde_json::json!({"mode": "lexical"}), // missing required `query`
];
let (items, summary) = bulk_search_with_config(cfg, raw).unwrap();
assert_eq!(items.len(), 2);
assert_eq!(summary.total, 2);
// First item: lexical mode against empty corpus succeeds with empty hits.
assert!(items[0].error.is_none());
// Second item: missing required field.
assert!(items[1].error.is_some());
assert_eq!(items[1].error.as_ref().unwrap()["code"], "invalid_input");
}
}

View File

@@ -0,0 +1,75 @@
//! p9-fb-34 opaque pagination cursor.
//!
//! Format: base64(JSON({offset: usize, corpus_revision: string})).
//! Opaque to callers — they MUST NOT decode the contents themselves;
//! the schema is internal and may change without notice.
use base64::Engine;
use base64::engine::general_purpose::URL_SAFE_NO_PAD;
use serde::{Deserialize, Serialize};
use serde_json::Value;
use crate::error_wire::ErrorV1;
#[derive(Serialize, Deserialize)]
struct Payload {
offset: usize,
corpus_revision: String,
}
/// Encode `(offset, corpus_revision)` as an opaque base64 string.
pub fn encode(offset: usize, corpus_revision: &str) -> String {
let payload = Payload {
offset,
corpus_revision: corpus_revision.to_string(),
};
let json = serde_json::to_vec(&payload).expect("Payload serializes");
URL_SAFE_NO_PAD.encode(&json)
}
/// Decode an opaque cursor against the expected `corpus_revision`.
/// Mismatch or malformed input returns an `ErrorV1` with
/// `code = "stale_cursor"`.
//
// p9-fb-34: ErrorV1 is the workspace-wide wire error struct (~200B
// after monomorphization with Value + String fields). Boxing here
// would force every call site to deref through a Box for no win —
// the err-path is rare. Single allow at the function level.
//
// p9-fb-34 round-1 review: differentiate the three failure modes
// (base64 / JSON / revision mismatch) with distinct messages — all
// keep `code = "stale_cursor"` so the agent's branching logic stays
// the same, but humans reading the message get a precise hint.
#[allow(clippy::result_large_err)]
pub fn decode(s: &str, expected_revision: &str) -> Result<usize, ErrorV1> {
let bytes = URL_SAFE_NO_PAD.decode(s.as_bytes()).map_err(|_| ErrorV1 {
schema_version: "error.v1".to_string(),
code: "stale_cursor".to_string(),
message: "cursor is not valid base64. Re-issue search to obtain a fresh cursor."
.to_string(),
details: Value::Null,
hint: None,
})?;
let payload: Payload = serde_json::from_slice(&bytes).map_err(|_| ErrorV1 {
schema_version: "error.v1".to_string(),
code: "stale_cursor".to_string(),
message: "cursor payload is malformed. Re-issue search to obtain a fresh cursor."
.to_string(),
details: Value::Null,
hint: None,
})?;
if payload.corpus_revision != expected_revision {
return Err(ErrorV1 {
schema_version: "error.v1".to_string(),
code: "stale_cursor".to_string(),
message: format!(
"cursor was issued against corpus_revision '{}'; current revision is \
'{}'. Re-issue search to obtain a fresh cursor.",
payload.corpus_revision, expected_revision
),
details: Value::Null,
hint: None,
});
}
Ok(payload.offset)
}

View File

@@ -0,0 +1,15 @@
//! Typed signal re-exports + new signals introduced by fb-27.
//!
//! kebab-cli (and future kebab-tui / kebab-desktop) downcast on these to
//! build `error.v1` wire records. The existing signals
//! (`RefusalSignal`, `NoHitSignal`, `DoctorUnhealthy`) live in
//! `doctor_signal.rs` — leave those unchanged and re-export via this
//! module so callers have one place to import from.
//!
//! See `docs/superpowers/specs/2026-05-07-p9-fb-27-introspection-and-error-wire-design.md`.
pub use crate::doctor_signal::{DoctorUnhealthy, NoHitSignal, RefusalSignal};
pub use kebab_llm_local::LlmError;
pub use kebab_config::ConfigInvalid;
pub use kebab_store_sqlite::NotIndexed;

View File

@@ -0,0 +1,260 @@
//! Map `anyhow::Error` (returned by `kebab-app` facade calls) to the
//! `error.v1` wire shape. The classifier downcasts to known typed errors
//! re-exported via `kebab_app::error_signal` (LlmError, ConfigInvalid,
//! NotIndexed) and falls back to `code: "generic"` for everything else.
//!
//! Refusal / no-hit / doctor-unhealthy are NOT routed here — they remain
//! exit-code-only signals (see main.rs `exit_code()`).
use serde::{Deserialize, Serialize};
use serde_json::{Value, json};
use crate::error_signal::{ConfigInvalid, LlmError, NotIndexed};
// p9-fb-34: `stale_cursor` is constructed directly by `cursor::decode`
// and surfaced through `StructuredError` (an anyhow-friendly wrapper
// that carries the typed `ErrorV1` payload without lossy string
// formatting). `classify` short-circuits on it at the top of the
// function so the typed `code = "stale_cursor"` reaches the wire.
/// Wire schema id for [`ErrorV1`]. Single source of truth — kebab-cli
/// + kebab-mcp use this via `kebab_app::ERROR_V1_ID`.
pub const ERROR_V1_ID: &str = "error.v1";
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ErrorV1 {
pub schema_version: String,
pub code: String,
pub message: String,
pub details: Value,
pub hint: Option<String>,
}
/// p9-fb-34: typed wrapper around an [`ErrorV1`] so callers that
/// surface `anyhow::Error` can downcast back to the structured wire
/// payload instead of losing it to string formatting. Constructed by
/// the cursor code path (`cursor::decode` → `App::search_with_opts`)
/// and short-circuited inside [`classify`].
#[derive(Debug)]
pub struct StructuredError(pub ErrorV1);
impl std::fmt::Display for StructuredError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "[{}] {}", self.0.code, self.0.message)
}
}
impl std::error::Error for StructuredError {}
pub fn classify(err: &anyhow::Error, verbose: bool) -> ErrorV1 {
// p9-fb-34: structured wrapper short-circuits — preserves the
// typed payload that callers (cursor::decode) constructed
// instead of falling through to `code = "generic"`.
if let Some(s) = err.downcast_ref::<StructuredError>() {
return s.0.clone();
}
if let Some(s) = err.downcast_ref::<ConfigInvalid>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "config_invalid".to_string(),
message: s.to_string(),
details: json!({
"path": s.path.to_string_lossy(),
"cause": s.cause,
}),
hint: Some("check `--config <path>` and TOML syntax".to_string()),
};
}
if let Some(s) = err.downcast_ref::<NotIndexed>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "not_indexed".to_string(),
message: s.to_string(),
details: json!({
"expected": s.expected,
"found": s.found,
}),
hint: Some("run `kebab init` then `kebab ingest`".to_string()),
};
}
if let Some(s) = err.downcast_ref::<LlmError>() {
return classify_llm(s);
}
if let Some(io) = err.downcast_ref::<std::io::Error>() {
return ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "io_error".to_string(),
message: io.to_string(),
details: json!({"kind": format!("{:?}", io.kind())}),
hint: None,
};
}
let mut details = json!({});
if verbose {
let chain: Vec<String> = err.chain().map(|c| c.to_string()).collect();
details = json!({"chain": chain});
}
ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "generic".to_string(),
message: err.to_string(),
details,
hint: None,
}
}
fn classify_llm(s: &LlmError) -> ErrorV1 {
match s {
LlmError::Unreachable { endpoint, source } => ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "model_unreachable".to_string(),
message: format!("ollama unreachable at {endpoint}"),
details: json!({
"endpoint": endpoint,
"source": source.to_string(),
}),
hint: Some(format!("ensure `ollama serve` is reachable at {endpoint}")),
},
LlmError::ModelNotPulled(model) => ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "model_not_pulled".to_string(),
message: format!("ollama model `{model}` is not pulled"),
details: json!({"model": model}),
hint: Some(format!("run `ollama pull {model}`")),
},
LlmError::Timeout(e) => ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "timeout".to_string(),
message: format!("ollama timeout: {e}"),
details: json!({"source": e.to_string()}),
hint: Some("increase timeout or check Ollama load".to_string()),
},
LlmError::Stream(body) => ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "generic".to_string(),
message: format!("ollama HTTP error: {body}"),
details: json!({"body": body}),
hint: None,
},
LlmError::Malformed(line) => ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "generic".to_string(),
message: format!("malformed response line: {line}"),
details: json!({"line": line}),
hint: None,
},
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn config_invalid_classifies_to_config_invalid_code() {
let err = anyhow::Error::new(ConfigInvalid {
path: std::path::PathBuf::from("/tmp/x.toml"),
cause: "missing".to_string(),
});
let v1 = classify(&err, false);
assert_eq!(v1.code, "config_invalid");
assert_eq!(v1.details.get("path").and_then(|p| p.as_str()), Some("/tmp/x.toml"));
assert!(v1.hint.is_some());
}
#[test]
fn not_indexed_classifies_correctly() {
let err = anyhow::Error::new(NotIndexed {
expected: "/data/k.sqlite".to_string(),
found: None,
});
let v1 = classify(&err, false);
assert_eq!(v1.code, "not_indexed");
}
#[test]
fn llm_unreachable_classifies_to_model_unreachable() {
// We cannot construct a reqwest::Error from scratch (private constructor).
// Approach: send a real request to a guaranteed-unroutable endpoint
// (port 1 is reserved + connect-refused on all conformant TCP stacks).
// 500ms timeout chosen as headroom over 50ms baseline — heavily loaded
// CI may hit timeout race instead of connect-refused, but either way
// the resulting LlmError::Unreachable maps to "model_unreachable".
let client = reqwest::blocking::Client::builder()
.timeout(std::time::Duration::from_millis(500))
.build().unwrap();
let err = client.get("http://127.0.0.1:1").send().unwrap_err();
let llm = LlmError::Unreachable {
endpoint: "http://127.0.0.1:1".to_string(),
source: err,
};
let anyhow_err = anyhow::Error::new(llm);
let v1 = classify(&anyhow_err, false);
assert_eq!(v1.code, "model_unreachable");
}
#[test]
fn model_not_pulled_classifies_correctly() {
let llm = LlmError::ModelNotPulled("gemma4:e4b".to_string());
let v1 = classify(&anyhow::Error::new(llm), false);
assert_eq!(v1.code, "model_not_pulled");
assert_eq!(v1.details.get("model").and_then(|p| p.as_str()), Some("gemma4:e4b"));
}
#[test]
fn unknown_error_classifies_to_generic() {
let err = anyhow::anyhow!("something else");
let v1 = classify(&err, false);
assert_eq!(v1.code, "generic");
assert!(v1.hint.is_none());
}
#[test]
fn generic_with_verbose_includes_chain() {
let err = anyhow::anyhow!("root").context("middle").context("leaf");
let v1 = classify(&err, true);
assert_eq!(v1.code, "generic");
let chain = v1.details.get("chain").and_then(|c| c.as_array()).unwrap();
assert_eq!(chain.len(), 3);
}
#[test]
fn io_error_classifies_correctly() {
let io = std::io::Error::new(std::io::ErrorKind::NotFound, "no such file");
let err = anyhow::Error::new(io);
let v1 = classify(&err, false);
assert_eq!(v1.code, "io_error");
}
#[test]
fn stale_cursor_is_not_routed_through_classify() {
use anyhow::anyhow;
let err: anyhow::Error = anyhow!("stale_cursor: rev mismatch");
let v1 = classify(&err, false);
// p9-fb-34: stale_cursor is constructed directly by cursor::decode
// (single source of truth). classify must not pattern-match on
// anyhow string contents — that would create two sources of
// truth. The bare anyhow string falls through to "generic".
assert_ne!(v1.code, "stale_cursor", "classify must not produce stale_cursor from bare anyhow string");
}
#[test]
fn stale_cursor_propagates_through_structured_wrapper() {
// p9-fb-34: positive-side contract for the structured-wrapper
// path. cursor::decode constructs a typed ErrorV1, the call site
// wraps it in `StructuredError`, anyhow carries it, and classify
// short-circuits via downcast — preserving the typed code +
// message instead of falling through to "generic".
let original = ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "stale_cursor".to_string(),
message: "test stale cursor".to_string(),
details: Value::Null,
hint: None,
};
let err: anyhow::Error = anyhow::Error::new(StructuredError(original));
let v1 = classify(&err, false);
assert_eq!(v1.code, "stale_cursor");
assert_eq!(v1.message, "test stale cursor");
}
}

View File

@@ -0,0 +1,253 @@
//! Helpers for the `_external/` workspace subdirectory used by
//! `ingest_file_with_config` and `ingest_stdin_with_config` (p9-fb-31).
//!
//! - `ensure_external_dir`: create `<workspace.root>/_external/` if absent.
//! - `ensure_kebabignore_entry`: append `_external/` to `<workspace.root>/.kebabignore`
//! if missing — prevents subsequent `kebab ingest` workspace walks from
//! re-walking files that were imported via single-file ingest.
//! - `copy_to_external`: write bytes to `_external/<blake3-12>.<ext>`, idempotent.
//! - `inject_frontmatter`: prepend a YAML frontmatter block to a markdown body
//! string (used by `ingest_stdin_with_config`).
use std::fs;
use std::io::Write;
use std::path::{Path, PathBuf};
use anyhow::{Context, Result};
pub const EXTERNAL_DIR: &str = "_external";
const KEBABIGNORE_LINE: &str = "_external/";
/// Ensure `<workspace_root>/_external/` exists. Returns the directory path.
pub fn ensure_external_dir(workspace_root: &Path) -> Result<PathBuf> {
let dir = workspace_root.join(EXTERNAL_DIR);
fs::create_dir_all(&dir)
.with_context(|| format!("create _external dir at {}", dir.display()))?;
Ok(dir)
}
/// Append `_external/` line to `<workspace_root>/.kebabignore` if not already
/// present. Idempotent — checks for the exact line before appending.
pub fn ensure_kebabignore_entry(workspace_root: &Path) -> Result<()> {
let path = workspace_root.join(".kebabignore");
let existing = if path.exists() {
fs::read_to_string(&path)
.with_context(|| format!("read existing .kebabignore at {}", path.display()))?
} else {
String::new()
};
let already = existing
.lines()
.any(|line| line.trim() == KEBABIGNORE_LINE);
if already {
return Ok(());
}
let mut file = fs::OpenOptions::new()
.create(true)
.append(true)
.open(&path)
.with_context(|| format!("open .kebabignore for append at {}", path.display()))?;
if !existing.is_empty() && !existing.ends_with('\n') {
file.write_all(b"\n")?;
}
writeln!(file, "{}", KEBABIGNORE_LINE)?;
Ok(())
}
/// Copy bytes to `<external_dir>/<blake3-12>.<ext>`. Idempotent — if the
/// destination file already exists with the expected hash, the existing
/// file is reused (no second write). Returns the destination path.
pub fn copy_to_external(
external_dir: &Path,
bytes: &[u8],
ext: &str,
) -> Result<PathBuf> {
let hash = blake3::hash(bytes);
let hex = hash.to_hex();
let prefix = &hex.as_str()[..12];
let filename = format!("{prefix}.{ext}");
let dest = external_dir.join(&filename);
if !dest.exists() {
fs::write(&dest, bytes)
.with_context(|| format!("write external file at {}", dest.display()))?;
}
Ok(dest)
}
/// Prepend a YAML frontmatter block to a markdown body. Returns the wrapped
/// markdown string. Errors if `body` already starts with `---` (the user
/// should use `ingest_file_with_config` for files that already carry
/// frontmatter).
///
/// Internal `yaml_quote` always uses double-quoted YAML form with backslash
/// escapes for `"` / `\` / control chars — agent-supplied titles with
/// special characters are safe.
pub fn inject_frontmatter(
body: &str,
title: &str,
source_uri: Option<&str>,
) -> Result<String> {
let head = body.trim_start();
if head.starts_with("---\n") || head.starts_with("---\r\n") || head.starts_with("---\r") {
anyhow::bail!(
"stdin already has frontmatter; use `kebab ingest-file` for files with metadata"
);
}
let title_yaml = yaml_quote(title);
let mut header = String::new();
header.push_str("---\n");
header.push_str(&format!("title: {title_yaml}\n"));
if let Some(uri) = source_uri {
let uri_yaml = yaml_quote(uri);
header.push_str(&format!("source_uri: {uri_yaml}\n"));
}
header.push_str("---\n\n");
header.push_str(body);
Ok(header)
}
/// YAML-quote a string. Always uses double-quoted form with backslash-escape
/// for `"` and `\`. Defensive against agent-supplied titles that contain
/// quotes / control chars.
fn yaml_quote(s: &str) -> String {
let mut out = String::with_capacity(s.len() + 2);
out.push('"');
for c in s.chars() {
match c {
'"' => out.push_str("\\\""),
'\\' => out.push_str("\\\\"),
'\n' => out.push_str("\\n"),
'\r' => out.push_str("\\r"),
c if (c as u32) < 0x20 => out.push_str(&format!("\\u{:04x}", c as u32)),
c => out.push(c),
}
}
out.push('"');
out
}
#[cfg(test)]
mod tests {
use super::*;
use tempfile::tempdir;
#[test]
fn ensure_external_dir_creates_dir() {
let dir = tempdir().unwrap();
let result = ensure_external_dir(dir.path()).unwrap();
assert_eq!(result, dir.path().join("_external"));
assert!(result.is_dir());
}
#[test]
fn ensure_external_dir_is_idempotent() {
let dir = tempdir().unwrap();
let _ = ensure_external_dir(dir.path()).unwrap();
let result = ensure_external_dir(dir.path()).unwrap();
assert!(result.is_dir());
}
#[test]
fn ensure_kebabignore_entry_creates_file_with_line() {
let dir = tempdir().unwrap();
ensure_kebabignore_entry(dir.path()).unwrap();
let content = fs::read_to_string(dir.path().join(".kebabignore")).unwrap();
assert!(content.lines().any(|l| l.trim() == "_external/"));
}
#[test]
fn ensure_kebabignore_entry_appends_to_existing() {
let dir = tempdir().unwrap();
fs::write(dir.path().join(".kebabignore"), "*.tmp\n").unwrap();
ensure_kebabignore_entry(dir.path()).unwrap();
let content = fs::read_to_string(dir.path().join(".kebabignore")).unwrap();
let lines: Vec<&str> = content.lines().collect();
assert!(lines.contains(&"*.tmp"));
assert!(lines.contains(&"_external/"));
}
#[test]
fn ensure_kebabignore_entry_idempotent() {
let dir = tempdir().unwrap();
ensure_kebabignore_entry(dir.path()).unwrap();
ensure_kebabignore_entry(dir.path()).unwrap();
let content = fs::read_to_string(dir.path().join(".kebabignore")).unwrap();
let count = content.lines().filter(|l| l.trim() == "_external/").count();
assert_eq!(count, 1, "should not duplicate");
}
#[test]
fn ensure_kebabignore_entry_handles_missing_trailing_newline() {
let dir = tempdir().unwrap();
fs::write(dir.path().join(".kebabignore"), "*.tmp").unwrap(); // no \n
ensure_kebabignore_entry(dir.path()).unwrap();
let content = fs::read_to_string(dir.path().join(".kebabignore")).unwrap();
let lines: Vec<&str> = content.lines().collect();
assert!(lines.contains(&"*.tmp"));
assert!(lines.contains(&"_external/"));
}
#[test]
fn copy_to_external_writes_with_hash_prefix_filename() {
let dir = tempdir().unwrap();
let ext_dir = ensure_external_dir(dir.path()).unwrap();
let path = copy_to_external(&ext_dir, b"hello", "md").unwrap();
assert!(path.exists());
assert!(path.file_name().unwrap().to_string_lossy().ends_with(".md"));
let stem = path.file_stem().unwrap().to_string_lossy();
assert_eq!(stem.len(), 12);
}
#[test]
fn copy_to_external_is_idempotent_for_same_bytes() {
let dir = tempdir().unwrap();
let ext_dir = ensure_external_dir(dir.path()).unwrap();
let p1 = copy_to_external(&ext_dir, b"hello", "md").unwrap();
let p2 = copy_to_external(&ext_dir, b"hello", "md").unwrap();
assert_eq!(p1, p2);
}
#[test]
fn copy_to_external_different_bytes_produce_different_filenames() {
let dir = tempdir().unwrap();
let ext_dir = ensure_external_dir(dir.path()).unwrap();
let p1 = copy_to_external(&ext_dir, b"hello", "md").unwrap();
let p2 = copy_to_external(&ext_dir, b"world", "md").unwrap();
assert_ne!(p1, p2);
}
#[test]
fn inject_frontmatter_basic() {
let out = inject_frontmatter("## Body", "Article X", None).unwrap();
assert!(out.starts_with("---\ntitle: \"Article X\"\n---\n\n## Body"));
}
#[test]
fn inject_frontmatter_with_source_uri() {
let out = inject_frontmatter("## Body", "X", Some("https://example.com/x")).unwrap();
assert!(out.contains("title: \"X\""));
assert!(out.contains("source_uri: \"https://example.com/x\""));
assert!(out.contains("\n## Body"));
}
#[test]
fn inject_frontmatter_errors_on_existing_frontmatter() {
let body = "---\ntitle: Existing\n---\n\n## Body";
let err = inject_frontmatter(body, "New", None).unwrap_err();
assert!(err.to_string().contains("already has frontmatter"));
}
#[test]
fn inject_frontmatter_errors_on_existing_frontmatter_crlf() {
let body = "---\r\ntitle: Existing\r\n---\r\n\r\n## Body";
let err = inject_frontmatter(body, "New", None).unwrap_err();
assert!(err.to_string().contains("already has frontmatter"));
}
#[test]
fn yaml_quote_escapes_quotes_and_backslashes() {
assert_eq!(yaml_quote("hello \"world\""), "\"hello \\\"world\\\"\"");
assert_eq!(yaml_quote("path\\to"), "\"path\\\\to\"");
assert_eq!(yaml_quote("line\nbreak"), "\"line\\nbreak\"");
}
}

View File

@@ -0,0 +1,447 @@
//! p9-fb-35 verbatim fetch implementation.
//!
//! [`App::fetch`] is the facade entry point. It dispatches on
//! [`FetchQuery`] variants:
//!
//! - `Chunk(id)` — return the chunk row from `chunks.text`, optionally
//! with ±N surrounding chunks (`FetchOpts::context`).
//! - `Doc(id)` — return the entire document re-serialized to markdown.
//! (Implemented in Task 4.)
//! - `Span { doc_id, line_start, line_end }` — return a contiguous line
//! slice. (Implemented in Task 5.)
//!
//! Errors are surfaced as [`StructuredError`] (anyhow-friendly wrapper
//! around `ErrorV1`) so the CLI / MCP wire layer's `classify` keeps the
//! typed `code` (`chunk_not_found` / `doc_not_found` /
//! `span_not_supported`) instead of falling through to `code =
//! "generic"`.
use anyhow::Result;
use time::OffsetDateTime;
use kebab_core::{
Block, CanonicalDocument, Chunk, ChunkId, DocumentId, DocumentStore, FetchKind, FetchOpts,
FetchQuery, FetchResult,
};
use crate::App;
use crate::error_wire::{ERROR_V1_ID, ErrorV1, StructuredError};
use crate::staleness::compute_stale;
impl App {
/// p9-fb-35: verbatim fetch facade. Returns text from
/// `chunks.text` / `CanonicalDocument` based on the requested
/// mode. Errors surface as `StructuredError(ErrorV1)` with one
/// of `chunk_not_found` / `doc_not_found` / `span_not_supported`
/// so the wire-layer classifier preserves the typed code.
pub fn fetch(&self, query: FetchQuery, opts: FetchOpts) -> Result<FetchResult> {
match query {
FetchQuery::Chunk(id) => fetch_chunk(self, id, opts),
FetchQuery::Doc(id) => fetch_doc(self, id, opts),
FetchQuery::Span {
doc_id,
line_start,
line_end,
} => fetch_span(self, doc_id, line_start, line_end, opts),
}
}
}
fn fetch_chunk(app: &App, id: ChunkId, opts: FetchOpts) -> Result<FetchResult> {
let target = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_chunk(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "chunk_not_found".to_string(),
message: format!("chunk_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
let doc_id = target.doc_id.clone();
let doc =
<kebab_store_sqlite::SqliteStore as DocumentStore>::get_document(&app.sqlite, &doc_id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!(
"doc_id '{}' (parent of chunk '{}') not found",
doc_id.0, id.0
),
details: serde_json::Value::Null,
hint: None,
}))
})?;
let (context_before, context_after) = match opts.context {
Some(n) if n > 0 => surrounding_chunks(app, &doc_id, &id, n)?,
_ => (Vec::new(), Vec::new()),
};
let now = OffsetDateTime::now_utc();
let stale = compute_stale(
doc_metadata_updated_at(&doc),
now,
app.config.search.stale_threshold_days,
);
Ok(FetchResult {
kind: FetchKind::Chunk,
doc_id: doc.doc_id.clone(),
doc_path: doc.workspace_path.clone(),
indexed_at: doc_metadata_updated_at(&doc),
stale,
chunk: Some(target),
context_before,
context_after,
text: None,
line_start: None,
line_end: None,
effective_end: None,
truncated: false,
})
}
fn fetch_doc(app: &App, id: DocumentId, opts: FetchOpts) -> Result<FetchResult> {
let doc = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_document(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
let mut text = fmt_canonical_to_markdown(&doc);
let mut truncated = false;
if let Some(max_tokens) = opts.max_tokens {
let max_chars = max_tokens.saturating_mul(4);
if text.chars().count() > max_chars {
text = trim_to_chars(&text, max_chars);
truncated = true;
}
}
let now = OffsetDateTime::now_utc();
let stale = compute_stale(
doc_metadata_updated_at(&doc),
now,
app.config.search.stale_threshold_days,
);
Ok(FetchResult {
kind: FetchKind::Doc,
doc_id: doc.doc_id.clone(),
doc_path: doc.workspace_path.clone(),
indexed_at: doc_metadata_updated_at(&doc),
stale,
chunk: None,
context_before: Vec::new(),
context_after: Vec::new(),
text: Some(text),
line_start: None,
line_end: None,
effective_end: None,
truncated,
})
}
/// p9-fb-35: trim string to N chars (Unicode-safe). Mirrors fb-34's
/// helper at `crates/kebab-app/src/app.rs` — kept local to avoid
/// re-exporting an internal helper.
fn trim_to_chars(s: &str, n: usize) -> String {
if s.chars().count() <= n {
return s.to_string();
}
let mut out = String::with_capacity(n * 4);
for (i, c) in s.chars().enumerate() {
if i >= n {
break;
}
out.push(c);
}
out
}
fn fetch_span(
app: &App,
id: DocumentId,
line_start: u32,
line_end: u32,
opts: FetchOpts,
) -> Result<FetchResult> {
let doc = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_document(&app.sqlite, &id)?
.ok_or_else(|| {
anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "doc_not_found".to_string(),
message: format!("doc_id '{}' not found", id.0),
details: serde_json::Value::Null,
hint: None,
}))
})?;
// Reject line-incompatible media types (PDF / audio). `SourceType`
// (markdown / note / paper / reference / inbox) is the *user-facing*
// category, not the rendering format — the actual byte-level format
// lives on the source `RawAsset.media_type`. Look it up via
// workspace_path (unique key per asset).
if let Some(asset) = <kebab_store_sqlite::SqliteStore as DocumentStore>::get_asset_by_workspace_path(
&app.sqlite,
&doc.workspace_path,
)? {
if matches!(
asset.media_type,
kebab_core::MediaType::Pdf | kebab_core::MediaType::Audio(_)
) {
return Err(anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "span_not_supported".to_string(),
message: format!(
"doc '{}' has media_type {:?}; line-based span fetch unsupported. \
Use `fetch chunk` or `fetch doc` instead.",
id.0, asset.media_type
),
details: serde_json::Value::Null,
hint: Some("kind = chunk or kind = doc instead".to_string()),
})));
}
}
if line_start == 0 || line_end == 0 || line_end < line_start {
return Err(anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: ERROR_V1_ID.to_string(),
code: "invalid_input".to_string(),
message: format!(
"line_start ({line_start}) and line_end ({line_end}) must be 1-based with start <= end"
),
details: serde_json::Value::Null,
hint: None,
})));
}
let full = fmt_canonical_to_markdown(&doc);
let lines: Vec<&str> = full.lines().collect();
let total = lines.len() as u32;
// p9-fb-35 round-1 review fix: empty / out-of-range request must
// not slice. Returning empty text + `effective_end = line_start - 1`
// lets the caller detect "no lines fetched" via
// `text.is_empty() && effective_end < line_start`. `truncated`
// stays false because line-range clamp is NOT a budget event —
// budget-driven truncation is the only thing `truncated` signals.
if total == 0 || line_start > total {
let now = OffsetDateTime::now_utc();
let stale = compute_stale(
doc_metadata_updated_at(&doc),
now,
app.config.search.stale_threshold_days,
);
return Ok(FetchResult {
kind: FetchKind::Span,
doc_id: doc.doc_id.clone(),
doc_path: doc.workspace_path.clone(),
indexed_at: doc_metadata_updated_at(&doc),
stale,
chunk: None,
context_before: Vec::new(),
context_after: Vec::new(),
text: Some(String::new()),
line_start: Some(line_start),
line_end: Some(line_end),
// saturating_sub: when line_start = 1 we end at 0, signaling
// "no lines fetched" without underflowing u32.
effective_end: Some(line_start.saturating_sub(1)),
truncated: false,
});
}
let effective_end_raw = line_end.min(total);
let lo = (line_start - 1) as usize;
let hi = effective_end_raw as usize;
let mut text = lines[lo..hi].join("\n");
// p9-fb-35 round-1 review fix: `truncated` is reserved for
// budget-driven truncation only. Line-range clamp (line_end >
// total) is signaled via `effective_end < line_end`, not via
// `truncated`.
let mut truncated = false;
let mut effective_end = effective_end_raw;
if let Some(max_tokens) = opts.max_tokens {
let max_chars = max_tokens.saturating_mul(4);
if text.chars().count() > max_chars {
text = trim_to_chars(&text, max_chars);
truncated = true;
let kept = text.lines().count() as u32;
effective_end = (line_start - 1) + kept;
}
}
let now = OffsetDateTime::now_utc();
let stale = compute_stale(
doc_metadata_updated_at(&doc),
now,
app.config.search.stale_threshold_days,
);
Ok(FetchResult {
kind: FetchKind::Span,
doc_id: doc.doc_id.clone(),
doc_path: doc.workspace_path.clone(),
indexed_at: doc_metadata_updated_at(&doc),
stale,
chunk: None,
context_before: Vec::new(),
context_after: Vec::new(),
text: Some(text),
line_start: Some(line_start),
line_end: Some(line_end),
effective_end: Some(effective_end),
truncated,
})
}
/// p9-fb-35: list chunks for a document in ordinal order, return
/// `(before, after)` slices around the target chunk_id. `n` caps each
/// side independently — the worst case is `2n` total neighbors when
/// the target sits in the middle of the doc.
fn surrounding_chunks(
app: &App,
doc_id: &DocumentId,
target: &ChunkId,
n: u32,
) -> Result<(Vec<Chunk>, Vec<Chunk>)> {
let chunks = list_chunks_in_order(app, doc_id)?;
let target_idx = chunks
.iter()
.position(|c| c.chunk_id == *target)
.ok_or_else(|| anyhow::anyhow!("chunk not found in doc chunk list"))?;
let n = n as usize;
let lo = target_idx.saturating_sub(n);
let hi = target_idx
.saturating_add(n)
.saturating_add(1)
.min(chunks.len());
let before: Vec<Chunk> = chunks[lo..target_idx].to_vec();
let after: Vec<Chunk> = chunks[target_idx + 1..hi].to_vec();
Ok((before, after))
}
/// p9-fb-35: chunks have no explicit ordinal column, so the underlying
/// helper sorts by `(created_at, chunk_id)` which matches insertion
/// order produced by the chunker (deterministic). The actual SQL lives
/// inside `kebab-store-sqlite` (`SqliteStore::list_chunk_ids_for_doc`)
/// to keep the facade crate free of direct rusqlite usage.
fn list_chunks_in_order(app: &App, doc_id: &DocumentId) -> Result<Vec<Chunk>> {
let chunk_ids = app.sqlite.list_chunk_ids_for_doc(doc_id)?;
let mut out: Vec<Chunk> = Vec::with_capacity(chunk_ids.len());
for cid in chunk_ids {
if let Some(chunk) =
<kebab_store_sqlite::SqliteStore as DocumentStore>::get_chunk(&app.sqlite, &cid)?
{
out.push(chunk);
}
}
Ok(out)
}
fn doc_metadata_updated_at(doc: &CanonicalDocument) -> OffsetDateTime {
doc.metadata.updated_at
}
/// p9-fb-35: serialize a `CanonicalDocument` back to markdown. Best-
/// effort round-trip — inline-styled spans (Strong/Emph children)
/// flatten to plain text via the already-flattened `TextBlock.text`
/// field. Good enough for an agent reading verbatim context. Used by
/// Task 4 (doc mode) and Task 5 (span mode).
pub(crate) fn fmt_canonical_to_markdown(doc: &CanonicalDocument) -> String {
let mut out = String::with_capacity(1024);
for (i, block) in doc.blocks.iter().enumerate() {
if i > 0 {
out.push_str("\n\n");
}
match block {
Block::Heading(h) => {
let level = h.level.clamp(1, 6) as usize;
for _ in 0..level {
out.push('#');
}
out.push(' ');
out.push_str(&h.text);
}
Block::Paragraph(t) => out.push_str(&t.text),
Block::Quote(t) => {
// Prefix every line with `> ` so block-quote round-trips.
for (li, line) in t.text.split('\n').enumerate() {
if li > 0 {
out.push('\n');
}
out.push_str("> ");
out.push_str(line);
}
}
Block::List(l) => {
for (idx, item) in l.items.iter().enumerate() {
if idx > 0 {
out.push('\n');
}
if l.ordered {
out.push_str(&format!("{}. {}", idx + 1, item.text));
} else {
out.push_str(&format!("- {}", item.text));
}
}
}
Block::Code(c) => {
out.push_str("```");
if let Some(lang) = &c.lang {
out.push_str(lang);
}
out.push('\n');
out.push_str(&c.code);
if !c.code.ends_with('\n') {
out.push('\n');
}
out.push_str("```");
}
Block::Table(t) => {
out.push_str(&t.headers.join(" | "));
out.push('\n');
// Markdown table separator — N copies of `---|` is
// acceptable for a verbatim re-serialization (renderer
// tolerates trailing pipe).
out.push_str(&"---|".repeat(t.headers.len()));
for row in &t.rows {
out.push('\n');
out.push_str(&row.join(" | "));
}
}
Block::ImageRef(img) => {
out.push_str(&format!("![{}]({})", img.alt, img.src));
}
Block::AudioRef(_a) => {
// Canonical doc carries the transcript on AudioRefBlock,
// but markdown has no native audio embed. Emit a stub
// marker so the agent sees something ran here.
out.push_str("(audio reference)");
}
}
}
out
}
/// p9-fb-35: free-function entry for CLI / MCP. Mirrors the
/// `*_with_config` pattern documented in the kebab-app crate root —
/// `kebab-cli` calls this so a `--config <path>` flag is honored.
#[doc(hidden)]
pub fn fetch_with_config(
config: kebab_config::Config,
query: FetchQuery,
opts: FetchOpts,
) -> Result<FetchResult> {
App::open_with_config(config)?.fetch(query, opts)
}

View File

@@ -22,7 +22,7 @@ use kebab_core::IngestItemKind;
/// `p9-fb-04`, `Aborted`) events. Mirrors the fields persisted into
/// `ingest_runs.progress_json` so external tooling can reconstruct the
/// run's outcome from either side.
#[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Serialize, Deserialize)]
#[derive(Clone, Debug, Default, Eq, PartialEq, Serialize, Deserialize)]
pub struct AggregateCounts {
pub scanned: u32,
pub new: u32,
@@ -35,6 +35,8 @@ pub struct AggregateCounts {
pub errors: u32,
pub chunks_indexed: u32,
pub embeddings_indexed: u32,
/// p9-fb-25: per-extension skip count. See [`IngestReport::skipped_by_extension`].
pub skipped_by_extension: std::collections::BTreeMap<String, u32>,
}
/// One streaming progress event. The CLI's `--json` mode serializes this
@@ -94,10 +96,25 @@ pub fn media_label(media: &kebab_core::MediaType) -> &'static str {
kebab_core::MediaType::Pdf => "pdf",
kebab_core::MediaType::Image(_) => "image",
kebab_core::MediaType::Audio(_) => "audio",
kebab_core::MediaType::Code(_) => "code",
kebab_core::MediaType::Other(_) => "other",
}
}
/// p9-fb-25: render `": A docx, B txt"` breakdown after the
/// `N skipped` count when the map is non-empty. Empty → empty
/// string (no extra punctuation). desc sort by count, ties broken
/// by key alphabetic.
pub fn render_skipped_breakdown(map: &std::collections::BTreeMap<String, u32>) -> String {
if map.is_empty() {
return String::new();
}
let mut entries: Vec<_> = map.iter().collect();
entries.sort_by(|a, b| b.1.cmp(a.1).then_with(|| a.0.cmp(b.0)));
let parts: Vec<String> = entries.iter().map(|(k, v)| format!("{v} {k}")).collect();
format!(": {}", parts.join(", "))
}
/// Best-effort send into an optional `mpsc::Sender`. A dropped receiver
/// is silently absorbed — the ingest hot path must not stall on a slow
/// consumer. Logged at `trace` for diagnostics.
@@ -132,6 +149,7 @@ mod tests {
media_label(&MediaType::Audio(kebab_core::AudioType::Wav)),
"audio"
);
assert_eq!(media_label(&MediaType::Code("rust".into())), "code");
assert_eq!(media_label(&MediaType::Other("x".into())), "other");
}
@@ -192,4 +210,19 @@ mod tests {
other => panic!("unexpected event: {other:?}"),
}
}
#[test]
fn render_skipped_breakdown_desc_sort_with_tiebreak() {
use std::collections::BTreeMap;
let mut m = BTreeMap::new();
assert_eq!(render_skipped_breakdown(&m), "");
m.insert("txt".to_string(), 1);
m.insert("docx".to_string(), 2);
m.insert("epub".to_string(), 1);
// 2 docx 먼저 (count desc), 그 다음 1 epub / 1 txt 는 alphabetic.
assert_eq!(
render_skipped_breakdown(&m),
": 2 docx, 1 epub, 1 txt".to_string()
);
}
}

View File

@@ -39,45 +39,52 @@ use std::sync::Arc;
use anyhow::{Context, anyhow};
use serde::{Deserialize, Serialize};
use kebab_chunk::{MdHeadingV1Chunker, PdfPageV1Chunker};
use kebab_chunk::{CodeJsAstV1Chunker, CodePythonAstV1Chunker, CodeRustAstV1Chunker, CodeTsAstV1Chunker, MdHeadingV1Chunker, PdfPageV1Chunker};
use kebab_core::{
Answer, Block, CanonicalDocument, Chunk, ChunkId, ChunkPolicy, ChunkerVersion, Chunker,
DocFilter, DocSummary, DocumentId, DocumentStore, Embedder, EmbeddingInput,
EmbeddingKind, ExtractContext, Extractor, IngestReport, Lang, LanguageModel, MediaType,
ParserVersion, RawAsset, SearchHit, SearchQuery, SourceConnector, SourceScope,
ParserVersion, RawAsset, SearchHit, SearchQuery, SourceScope,
SourceUri, VectorRecord, VectorStore,
};
use kebab_llm_local::OllamaLanguageModel;
use kebab_normalize::build_canonical_document;
use kebab_parse_image::{ImageExtractor, OllamaVisionOcr, apply_caption, apply_ocr};
use kebab_parse_code::{JavascriptAstExtractor, PythonAstExtractor, RustAstExtractor, TypescriptAstExtractor};
use kebab_parse_pdf::PdfTextExtractor;
use kebab_parse_md::{BodyHints, parse_blocks, parse_frontmatter};
use kebab_source_fs::FsSourceConnector;
mod app;
mod bulk;
pub mod cursor;
pub mod doctor_signal;
pub mod error_signal;
pub mod error_wire;
pub mod external;
pub mod fetch;
pub mod ingest_progress;
pub mod logging;
pub mod reset;
pub mod schema;
mod staleness;
pub use app::App;
pub use ingest_progress::{AggregateCounts, IngestEvent};
pub use app::{App, SearchResponse};
pub use ingest_progress::{AggregateCounts, IngestEvent, render_skipped_breakdown};
pub use reset::{ResetReport, ResetScope};
pub use error_wire::{ERROR_V1_ID, ErrorV1, StructuredError, classify};
pub use fetch::fetch_with_config;
#[doc(hidden)]
pub use bulk::{BULK_QUERIES_MAX, bulk_search_with_config};
pub use schema::{Capabilities, Models, SCHEMA_V1_ID, SchemaV1, Stats, WireBlock, schema_with_config};
pub use staleness::{compute_stale, mark_stale_in_place};
/// Parser-version label persisted in `documents.parser_version` for
/// every Markdown file ingested through the `kb-parse-md` pipeline.
/// Kept in lock-step with the literal used in the `kb-store-sqlite`
/// idempotency / round-trip tests so the version label written by the
/// app and the one used in cross-crate fixtures match.
///
/// p9-fb-07 bumped this from `pulldown-cmark-0.x` to `md-frontmatter-v2`
/// because `kebab-normalize::derive_title` now applies a fallback chain
/// (frontmatter → H1 → H2 → first paragraph → file stem) when the
/// frontmatter title is blank. The bump invalidates `doc_id` for every
/// pre-existing Markdown document, so a re-ingest is required for the
/// new titles to land — this is the documented cascade behavior per
/// design §9.
const KEBAB_PARSE_MD_VERSION: &str = "md-frontmatter-v2";
/// p9-fb-25: sentinel for files without an extension in
/// `IngestReport.skipped_by_extension` keys + `IngestItem.warnings`
/// `unsupported media type: ...` line. Wire schema description
/// references this literal — changing the sentinel is a wire-
/// compatibility break.
pub const NO_EXT_SENTINEL: &str = "<no-ext>";
/// Caller-supplied knobs for one [`ask`] invocation.
///
@@ -85,7 +92,7 @@ const KEBAB_PARSE_MD_VERSION: &str = "md-frontmatter-v2";
/// `use kebab_app::AskOpts` keeps working without churn. The struct gained
/// a `stream_sink` field in P4-3; non-streaming callers (kb-cli today)
/// pass `stream_sink: None`.
pub use kebab_rag::AskOpts;
pub use kebab_rag::{AskOpts, StreamEvent};
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct DoctorReport {
@@ -146,6 +153,14 @@ pub fn init_workspace(force: bool) -> anyhow::Result<()> {
# — relative paths resolve against the directory of THIS
# config file, NOT the user's `cwd` at invocation time.
#
# 처리 가능한 형식 (extractor 가 자동 결정 — config 에 명시할 수 없음):
# • Markdown: .md
# • 이미지: .png .jpg .jpeg (OCR + caption)
# • PDF: .pdf
# 다른 확장자는 ingest 시 자동 skip + warning. 처리 대상 폴더의
# 일부만 ingest 하고 싶으면 `kebab ingest <path>` 로 root 명시
# 또는 `.kebabignore` 파일 / 본 `workspace.exclude` 로 denylist.
#
# Override individual keys at runtime with `KEBAB_*` env vars
# (e.g. `KEBAB_WORKSPACE_ROOT=/tmp/test kebab ingest`).
\n";
@@ -291,8 +306,8 @@ pub fn ingest_with_config_opts(
);
let connector = FsSourceConnector::new(&app.config)
.context("kb-app::ingest: build FsSourceConnector")?;
let assets = connector
.scan(&scope)
let (assets, fs_skips) = connector
.scan_with_skips(&scope)
.context("kb-app::ingest: scan workspace")?;
crate::ingest_progress::emit(
progress,
@@ -315,7 +330,7 @@ pub fn ingest_with_config_opts(
.context("kb-app::ingest: ensure Lance table")?;
}
let parser_version = ParserVersion(KEBAB_PARSE_MD_VERSION.to_string());
let parser_version = ParserVersion(kebab_parse_md::PARSER_VERSION.to_string());
let chunk_policy = chunk_policy_from_config(&app.config);
// P6-4: build OCR / caption adapters once per ingest invocation,
@@ -375,6 +390,9 @@ pub fn ingest_with_config_opts(
// without re-walking the DB.
let mut chunks_indexed: u32 = 0;
let mut embeddings_indexed: u32 = 0;
// p9-fb-25: per-extension skip count, populated in the Skipped arm below.
let mut skipped_by_extension: std::collections::BTreeMap<String, u32> =
std::collections::BTreeMap::new();
let scanned_count: u32 = u32::try_from(assets.len()).unwrap_or(u32::MAX);
let embed_active = embedder.is_some() && vector_store.is_some();
@@ -464,7 +482,9 @@ pub fn ingest_with_config_opts(
}
}
kebab_core::IngestItemKind::Skipped => {
skipped_count = skipped_count.saturating_add(1)
skipped_count = skipped_count.saturating_add(1);
let ext = ext_for_skip_warning(&item.doc_path.0);
*skipped_by_extension.entry(ext).or_insert(0) += 1;
}
kebab_core::IngestItemKind::Unchanged => {
unchanged_count = unchanged_count.saturating_add(1)
@@ -613,6 +633,7 @@ pub fn ingest_with_config_opts(
errors: error_count,
chunks_indexed,
embeddings_indexed,
skipped_by_extension: skipped_by_extension.clone(),
};
let terminal_event = if was_cancelled {
crate::ingest_progress::IngestEvent::Aborted {
@@ -654,6 +675,13 @@ pub fn ingest_with_config_opts(
unchanged: unchanged_count,
errors: error_count,
duration_ms,
skipped_by_extension,
skipped_gitignore: fs_skips.skipped_gitignore,
skipped_kebabignore: fs_skips.skipped_kebabignore,
skipped_builtin_blacklist: fs_skips.skipped_builtin_blacklist,
skipped_generated: fs_skips.skipped_generated,
skipped_size_exceeded: fs_skips.skipped_size_exceeded,
skip_examples: fs_skips.skip_examples,
items: if summary_only { None } else { Some(items) },
})
}
@@ -813,6 +841,31 @@ fn try_skip_unchanged(
}))
}
/// p9-fb-25: extract the lowercase extension (no leading dot) from a
/// workspace path for use in the `unsupported media type: .X` warning
/// and `IngestReport.skipped_by_extension` key. Returns [`NO_EXT_SENTINEL`]
/// for paths with no extension. Always lowercase so `Foo.DOCX` and
/// `bar.docx` aggregate under the same key.
fn ext_for_skip_warning(path: &str) -> String {
std::path::Path::new(path)
.extension()
.and_then(|s| s.to_str())
.map(|s| s.to_ascii_lowercase())
.unwrap_or_else(|| NO_EXT_SENTINEL.to_string())
}
/// p9-fb-25: render the `IngestItem.warnings` line for a Skipped
/// asset. [`NO_EXT_SENTINEL`] renders without a leading dot;
/// everything else gets `.ext` form.
fn unsupported_media_warning(path: &str) -> String {
let ext = ext_for_skip_warning(path);
if ext == NO_EXT_SENTINEL {
format!("unsupported media type: {NO_EXT_SENTINEL}")
} else {
format!("unsupported media type: .{ext}")
}
}
/// Process a single asset: read bytes, parse, normalize, chunk,
/// persist, embed. Per-asset failures bubble up to the caller for
/// labelling as `IngestItemKind::Error` — they do NOT abort the
@@ -865,7 +918,24 @@ fn ingest_one_asset(
force_reingest,
);
}
_ => {
// p10-1A-2 / 1B: code ingest dispatch.
MediaType::Code(lang)
if matches!(lang.as_str(), "rust" | "python" | "typescript" | "javascript") =>
{
return ingest_one_code_asset(
app,
asset,
chunk_policy,
embedder,
vector_store,
existing_doc_ids,
force_reingest,
lang.as_str(),
);
}
// p10-1A-2: non-Rust Code, Audio, and Other are not yet wired;
// skip until their respective phases.
MediaType::Code(_) | MediaType::Audio(_) | MediaType::Other(_) => {
return Ok(kebab_core::IngestItem {
kind: kebab_core::IngestItemKind::Skipped,
doc_id: None,
@@ -876,7 +946,7 @@ fn ingest_one_asset(
chunk_count: None,
parser_version: None,
chunker_version: None,
warnings: Vec::new(),
warnings: vec![unsupported_media_warning(&asset.workspace_path.0)],
error: None,
});
}
@@ -895,9 +965,7 @@ fn ingest_one_asset(
chunk_count: None,
parser_version: None,
chunker_version: None,
warnings: vec![
"kb:// source URIs are not supported by the fs ingester".into(),
],
warnings: vec!["kb:// URI not yet supported".to_string()],
error: None,
});
}
@@ -1090,7 +1158,7 @@ fn ingest_one_image_asset(
parser_version: None,
chunker_version: None,
warnings: vec![
"kb:// source URIs are not supported by the fs ingester".into(),
"kb:// URI not yet supported".to_string(),
],
error: None,
});
@@ -1425,7 +1493,7 @@ fn ingest_one_pdf_asset(
parser_version: None,
chunker_version: None,
warnings: vec![
"kb:// source URIs are not supported by the fs ingester".into(),
"kb:// URI not yet supported".to_string(),
],
error: None,
});
@@ -1567,6 +1635,213 @@ fn ingest_one_pdf_asset(
})
}
/// p10-1A-2 Task 8: process one `MediaType::Code("rust")` asset end-to-end.
///
/// Mirrors `ingest_one_pdf_asset` line-for-line with the substitutions
/// documented in the task spec:
/// - parser_version → `code-rust-v1` (via `RUST_PARSER_VERSION`)
/// - extractor → `RustAstExtractor`
/// - chunker → `CodeRustAstV1Chunker`
///
/// All other steps (incremental skip, byte read, ExtractContext, put_*,
/// embed, purge_vector_orphans) are identical to the PDF function.
#[allow(clippy::too_many_arguments)]
fn ingest_one_code_asset(
app: &App,
asset: &RawAsset,
chunk_policy: &ChunkPolicy,
embedder: Option<&Arc<dyn Embedder + Send + Sync>>,
vector_store: Option<&Arc<kebab_store_vector::LanceVectorStore>>,
existing_doc_ids: &std::collections::HashSet<String>,
force_reingest: bool,
code_lang: &str, // <-- NEW (p10-1b Task D)
) -> anyhow::Result<kebab_core::IngestItem> {
let path = match &asset.source_uri {
SourceUri::File(p) => p.clone(),
SourceUri::Kb(_) => {
return Ok(kebab_core::IngestItem {
kind: kebab_core::IngestItemKind::Skipped,
doc_id: None,
doc_path: asset.workspace_path.clone(),
asset_id: Some(asset.asset_id.clone()),
byte_len: Some(asset.byte_len),
block_count: None,
chunk_count: None,
parser_version: None,
chunker_version: None,
warnings: vec![
"kb:// URI not yet supported".to_string(),
],
error: None,
});
}
};
// p10-1b Task D/G/J: parser_version per-lang.
let parser_version = match code_lang {
"rust" => ParserVersion(kebab_parse_code::RUST_PARSER_VERSION.to_string()),
"python" => ParserVersion(kebab_parse_code::PYTHON_PARSER_VERSION.to_string()),
"typescript" => ParserVersion(kebab_parse_code::TS_PARSER_VERSION.to_string()),
"javascript" => ParserVersion(kebab_parse_code::JS_PARSER_VERSION.to_string()),
other => anyhow::bail!("unsupported code_lang: {other}"),
};
// p10-1b Task D/G/J/L: chunker_version per-lang.
let chunker_version = match code_lang {
"rust" => CodeRustAstV1Chunker.chunker_version(),
"python" => CodePythonAstV1Chunker.chunker_version(),
"typescript" => CodeTsAstV1Chunker.chunker_version(),
"javascript" => CodeJsAstV1Chunker.chunker_version(),
other => anyhow::bail!("unreachable chunker_version: {other}"),
};
if let Some(item) = try_skip_unchanged(
app,
asset,
&parser_version,
&chunker_version,
embedder.map(|e| e.model_version()).as_ref(),
force_reingest,
)? {
return Ok(item);
}
let bytes = std::fs::read(&path)
.with_context(|| format!("read code asset bytes from {}", path.display()))?;
let extract_config = kebab_core::ExtractConfig::default();
let workspace_root = app.config.resolve_workspace_root();
let ctx = ExtractContext {
asset,
workspace_root: &workspace_root,
config: &extract_config,
};
// p10-1b Task D/G/J/L: extractor per-lang.
let mut canonical = match code_lang {
"rust" => RustAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::RustAstExtractor::extract (code:rust)")?,
"python" => PythonAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::PythonAstExtractor::extract (code:python)")?,
"typescript" => TypescriptAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::TypescriptAstExtractor::extract (code:typescript)")?,
"javascript" => JavascriptAstExtractor::new()
.extract(&ctx, &bytes)
.context("kb-parse-code::JavascriptAstExtractor::extract (code:javascript)")?,
other => anyhow::bail!("unreachable (extract): {other}"),
};
// p10-1b Task D/G/J/L: chunker per-lang.
let chunks = match code_lang {
"rust" => CodeRustAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeRustAstV1Chunker::chunk (code:rust)")?,
"python" => CodePythonAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodePythonAstV1Chunker::chunk (code:python)")?,
"typescript" => CodeTsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeTsAstV1Chunker::chunk (code:typescript)")?,
"javascript" => CodeJsAstV1Chunker
.chunk(&canonical, chunk_policy)
.context("kb-chunk::CodeJsAstV1Chunker::chunk (code:javascript)")?,
other => anyhow::bail!("unreachable (chunk): {other}"),
};
// Stamp chunker + embedding versions so incremental skip detection has
// data on the second run.
canonical.last_chunker_version = Some(chunker_version.clone());
if let Some(emb) = embedder {
canonical.last_embedding_version = Some(emb.model_version());
}
purge_vector_orphans_for_workspace_path(app, asset, vector_store)?;
app.sqlite
.put_asset_with_bytes(asset, &bytes)
.context("DocumentStore::put_asset_with_bytes (code)")?;
app.sqlite
.put_document(&canonical)
.context("DocumentStore::put_document (code)")?;
app.sqlite
.put_blocks(&canonical.doc_id, &canonical.blocks)
.context("DocumentStore::put_blocks (code)")?;
app.sqlite
.put_chunks(&canonical.doc_id, &chunks)
.context("DocumentStore::put_chunks (code)")?;
if let (Some(emb), Some(vec_store)) = (embedder, vector_store)
&& !chunks.is_empty()
{
let inputs: Vec<EmbeddingInput<'_>> = chunks
.iter()
.map(|c| EmbeddingInput {
text: c.text.as_str(),
kind: EmbeddingKind::Document,
})
.collect();
let vectors = emb
.embed(&inputs)
.context("Embedder::embed (code chunks)")?;
let model_id = emb.model_id();
let model_version = emb.model_version();
let dimensions = emb.dimensions();
let records: Vec<VectorRecord> = chunks
.iter()
.zip(vectors)
.map(|(c, v)| VectorRecord {
embedding_id: kebab_core::id_for_embedding(
&c.chunk_id,
&model_id,
&model_version,
dimensions,
),
chunk_id: c.chunk_id.clone(),
vector: v,
doc_id: canonical.doc_id.clone(),
text: c.text.clone(),
heading_path: c.heading_path.clone(),
model_id: model_id.clone(),
model_version: model_version.clone(),
dimensions,
})
.collect();
vec_store
.upsert(&records)
.context("VectorStore::upsert (code)")?;
}
let kind = if existing_doc_ids.contains(&canonical.doc_id.0) {
kebab_core::IngestItemKind::Updated
} else {
kebab_core::IngestItemKind::New
};
// Surface every `Provenance::Warning` note onto `IngestItem.warnings`.
let warnings: Vec<String> = canonical
.provenance
.events
.iter()
.filter(|e| e.kind == kebab_core::ProvenanceKind::Warning)
.filter_map(|e| e.note.clone())
.collect();
Ok(kebab_core::IngestItem {
kind,
doc_id: Some(canonical.doc_id.clone()),
doc_path: asset.workspace_path.clone(),
asset_id: Some(asset.asset_id.clone()),
byte_len: Some(asset.byte_len),
block_count: u32::try_from(canonical.blocks.len()).ok(),
chunk_count: u32::try_from(chunks.len()).ok(),
parser_version: Some(canonical.parser_version.clone()),
chunker_version: Some(chunker_version),
warnings,
error: None,
})
}
/// Pull the BCP-47 language hint from the canonical document. P6-1
/// stamps `Lang("und")` by default; image-pipeline OCR / caption
/// adapters special-case "und" so the hint is intentionally dropped
@@ -1701,6 +1976,19 @@ pub fn search_uncached_with_config(
App::open_with_config(config)?.search_uncached(query)
}
/// p9-fb-34: budget-aware search free function. Mirrors
/// [`search_with_config`] but threads `SearchOpts` (max_tokens,
/// snippet_chars, cursor) and returns the [`SearchResponse`]
/// pagination wrapper. Tasks 6+8 surface this via CLI / MCP.
#[doc(hidden)]
pub fn search_with_opts_with_config(
config: kebab_config::Config,
query: kebab_core::SearchQuery,
opts: kebab_core::SearchOpts,
) -> anyhow::Result<SearchResponse> {
App::open_with_config(config)?.search_with_opts(query, opts)
}
// ── ask ──────────────────────────────────────────────────────────────────
//
// P4-3 wires `ask` end-to-end. The retriever is built per `opts.mode`;
@@ -1839,3 +2127,143 @@ pub fn doctor_with_config_path(config_path: Option<&std::path::Path>) -> anyhow:
pub fn doctor() -> anyhow::Result<DoctorReport> {
doctor_with_config_path(None)
}
/// Single-file ingest (p9-fb-31). Copies the file to
/// `<workspace.root>/_external/<blake3-12>.<ext>` and runs the
/// per-medium ingest pipeline on that single asset. Returns an
/// `IngestReport` with `scanned: 1` (and either `new: 1` or
/// `unchanged: 1` depending on whether the content hash + version
/// cascade match an existing doc — incremental ingest from p9-fb-23).
///
/// `path` may point inside or outside the workspace.
///
/// `.kebabignore` patterns matching `path` are bypassed with a stderr
/// `warn:` line — explicit ingest is intent.
#[doc(hidden)]
pub fn ingest_file_with_config(
config: kebab_config::Config,
path: &std::path::Path,
) -> anyhow::Result<IngestReport> {
if !path.exists() {
anyhow::bail!("ingest-file: source path does not exist: {}", path.display());
}
if !path.is_file() {
anyhow::bail!("ingest-file: not a regular file: {}", path.display());
}
let ext_raw = path
.extension()
.and_then(|e| e.to_str())
.ok_or_else(|| anyhow::anyhow!("ingest-file: source has no extension: {}", path.display()))?;
let ext = ext_raw.to_lowercase();
const SUPPORTED_EXTS: &[&str] = &["md", "pdf", "png", "jpg", "jpeg"];
if !SUPPORTED_EXTS.contains(&ext.as_str()) {
anyhow::bail!(
"ingest-file: unsupported extension `.{}` (supported: {:?})",
ext, SUPPORTED_EXTS
);
}
let bytes = std::fs::read(path)
.with_context(|| format!("ingest-file: read source {}", path.display()))?;
let workspace_root = config.resolve_workspace_root();
// .kebabignore check — warn but continue.
let ignore_match = check_kebabignore_match(&workspace_root, path);
if ignore_match {
eprintln!(
"warn: {} matches .kebabignore patterns; proceeding (explicit ingest bypasses ignore)",
path.display()
);
}
// Set up _external/ dir + auto-ignore line.
let external_dir = crate::external::ensure_external_dir(&workspace_root)
.context("ingest-file: ensure _external/ dir")?;
crate::external::ensure_kebabignore_entry(&workspace_root)
.context("ingest-file: append _external/ to .kebabignore")?;
// Copy bytes to _external/<hash>.<ext>.
let dest = crate::external::copy_to_external(&external_dir, &bytes, &ext)
.context("ingest-file: copy to _external")?;
// Build a SourceScope that targets _external/ with include filter
// restricting walk to the single dest filename.
let filename = dest
.file_name()
.ok_or_else(|| anyhow::anyhow!("ingest-file: dest has no filename"))?
.to_string_lossy()
.into_owned();
let scope = kebab_core::SourceScope {
root: external_dir.clone(),
include: vec![filename],
exclude: config.workspace.exclude.clone(),
};
let opts = IngestOpts::default();
ingest_with_config_opts(config, scope, /* summary_only = */ false, opts)
}
/// Stdin ingest (p9-fb-31, v1 markdown only). Prepends a YAML
/// frontmatter block (`title` + optional `source_uri`) to `body`,
/// writes the wrapped markdown to `_external/<hash12>.md`, and runs
/// `ingest_file_with_config` on the resulting file.
///
/// Errors if `body` already starts with `---` (the user should call
/// `ingest_file_with_config` directly for files that already carry
/// frontmatter).
#[doc(hidden)]
pub fn ingest_stdin_with_config(
config: kebab_config::Config,
body: &str,
title: &str,
source_uri: Option<&str>,
) -> anyhow::Result<IngestReport> {
let wrapped = crate::external::inject_frontmatter(body, title, source_uri)?;
let workspace_root = config.resolve_workspace_root();
// Note: ensure_external_dir + ensure_kebabignore_entry + copy_to_external
// are called here AND inside ingest_file_with_config. All three are
// idempotent; the redundancy is intentional — keeping stdin's wrapped
// bytes accessible by `ingest_file_with_config` requires the dest path
// to exist. The ~ms double-stat overhead is negligible at v1 scale.
let external_dir = crate::external::ensure_external_dir(&workspace_root)?;
crate::external::ensure_kebabignore_entry(&workspace_root)?;
let dest = crate::external::copy_to_external(
&external_dir,
wrapped.as_bytes(),
"md",
)?;
ingest_file_with_config(config, &dest)
}
/// Returns true if `source_path` matches any `.kebabignore` pattern
/// rooted at `workspace_root`. Used by `ingest_file_with_config` to
/// emit a stderr warn before bypassing the ignore.
fn check_kebabignore_match(workspace_root: &std::path::Path, source_path: &std::path::Path) -> bool {
let kebabignore = workspace_root.join(".kebabignore");
if !kebabignore.exists() {
return false;
}
let text = match std::fs::read_to_string(&kebabignore) {
Ok(s) => s,
Err(_) => return false,
};
let mut builder = ignore::gitignore::GitignoreBuilder::new(workspace_root);
for line in text.lines() {
let line = line.trim();
if line.is_empty() || line.starts_with('#') {
continue;
}
let _ = builder.add_line(None, line);
}
let matcher = match builder.build() {
Ok(m) => m,
Err(_) => return false,
};
matcher.matched(source_path, source_path.is_dir()).is_ignore()
}

View File

@@ -0,0 +1,246 @@
//! `kebab schema` — introspection report. See spec
//! `docs/superpowers/specs/2026-05-07-p9-fb-27-introspection-and-error-wire-design.md`.
use serde::{Deserialize, Serialize};
use kebab_config::Config;
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SchemaV1 {
pub schema_version: String,
pub kebab_version: String,
pub wire: WireBlock,
pub capabilities: Capabilities,
pub models: Models,
pub stats: Stats,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct WireBlock {
pub schemas: Vec<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Capabilities {
pub json_mode: bool,
pub ingest_progress: bool,
pub ingest_cancellation: bool,
pub rag_multi_turn: bool,
pub search_cache: bool,
pub incremental_ingest: bool,
pub streaming_ask: bool,
pub http_daemon: bool,
pub mcp_server: bool,
pub single_file_ingest: bool,
pub bulk_search: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Models {
pub parser_version: String,
pub chunker_version: String,
pub embedding_version: String,
pub prompt_template_version: String,
pub index_version: String,
pub corpus_revision: u64,
}
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct Stats {
pub doc_count: u64,
pub chunk_count: u64,
pub asset_count: u64,
pub last_ingest_at: Option<String>,
/// p9-fb-37: per-media-kind doc count (5 keys, zero-padded).
#[serde(default)]
pub media_breakdown: std::collections::BTreeMap<String, u64>,
/// p9-fb-37: per-language doc count, NULL keyed as `"null"`.
#[serde(default)]
pub lang_breakdown: std::collections::BTreeMap<String, u64>,
/// p9-fb-37: on-disk byte sums.
#[serde(default)]
pub index_bytes: kebab_core::IndexBytes,
/// p9-fb-37: docs whose `updated_at` exceeds the staleness threshold.
#[serde(default)]
pub stale_doc_count: u64,
/// p10-1A-1: code language breakdown (chunk counts by canonical lowercase
/// language identifier). Empty until 1A-2 produces code chunks.
#[serde(default)]
pub code_lang_breakdown: std::collections::BTreeMap<String, u32>,
/// p10-1A-1: repo breakdown (chunk counts by `metadata.repo` value).
/// Empty until 1A-2 produces code chunks.
#[serde(default)]
pub repo_breakdown: std::collections::BTreeMap<String, u32>,
}
const KEBAB_VERSION: &str = env!("CARGO_PKG_VERSION");
/// Wire schema id for [`SchemaV1`]. Single source of truth — `kebab-cli`
/// re-uses this via `kebab_app::schema::SCHEMA_V1_ID` when wrapping.
pub const SCHEMA_V1_ID: &str = "schema.v1";
// Authoritative list of wire schemas this binary emits. Keep in sync with
// `docs/wire-schema/v1/*.schema.json` and `kebab-cli::wire::wire_*` helpers.
const WIRE_SCHEMAS: &[&str] = &[
"answer.v1",
"search_hit.v1",
"search_response.v1",
"doc_summary.v1",
"chunk_inspection.v1",
"doctor.v1",
"ingest_report.v1",
"ingest_progress.v1",
"reset_report.v1",
"citation.v1",
"schema.v1",
"error.v1",
"bulk_search_item.v1",
"bulk_search_response.v1",
];
/// Build a [`SchemaV1`] introspection report for the given config.
///
/// Opens the SQLite store read-only via [`kebab_store_sqlite::SqliteStore::open_existing`]
/// so the caller (kebab-cli) does not need write access to the data dir.
/// Returns a [`kebab_store_sqlite::NotIndexed`] error (wrapped in `anyhow`)
/// if the database file does not exist — the CLI translates that to an
/// `error.v1` / `"not_indexed"` wire record.
#[doc(hidden)]
pub fn schema_with_config(cfg: &Config) -> anyhow::Result<SchemaV1> {
let store = open_store_for_stats(cfg)?;
let stats = collect_stats(cfg, &store)?;
let models = collect_models(cfg, &store);
Ok(SchemaV1 {
schema_version: SCHEMA_V1_ID.to_string(),
kebab_version: KEBAB_VERSION.to_string(),
wire: WireBlock {
schemas: WIRE_SCHEMAS.iter().map(|s| (*s).to_string()).collect(),
},
capabilities: capabilities_snapshot(),
models,
stats,
})
}
fn capabilities_snapshot() -> Capabilities {
Capabilities {
json_mode: true,
ingest_progress: true,
ingest_cancellation: true,
rag_multi_turn: true,
search_cache: true,
incremental_ingest: true,
streaming_ask: false,
http_daemon: false,
mcp_server: true,
single_file_ingest: false,
bulk_search: true,
}
}
fn open_store_for_stats(cfg: &Config) -> anyhow::Result<kebab_store_sqlite::SqliteStore> {
// Mirror the data_dir resolution used in SqliteStore::open:
// kebab_config::expand_path(&cfg.storage.data_dir, "") resolves tilde
// and env vars. The SQLITE_FILE name ("kebab.sqlite") is the canonical
// file name defined in kebab-store-sqlite.
let data_dir = kebab_config::expand_path(&cfg.storage.data_dir, "");
let db_path = data_dir.join("kebab.sqlite");
kebab_store_sqlite::SqliteStore::open_existing(&db_path)
}
fn collect_stats(
cfg: &Config,
store: &kebab_store_sqlite::SqliteStore,
) -> anyhow::Result<Stats> {
let counts = store
.count_summary_with_threshold(cfg.search.stale_threshold_days as u64)?;
let data_dir = kebab_config::expand_path(&cfg.storage.data_dir, "");
let index_bytes = kebab_store_sqlite::stats_ext::index_bytes(&data_dir)
.map_err(|e| anyhow::anyhow!("index_bytes: {e}"))?;
Ok(Stats {
doc_count: counts.doc_count,
chunk_count: counts.chunk_count,
asset_count: counts.asset_count,
last_ingest_at: counts.last_ingest_at,
media_breakdown: counts.media_breakdown,
lang_breakdown: counts.lang_breakdown,
index_bytes,
stale_doc_count: counts.stale_doc_count,
// p10-1A-2: populated by the store query added in this task.
code_lang_breakdown: store.code_lang_breakdown()?,
// p10-1A-2 follow-up: dogfooding (2026-05-20) revealed this was a
// placeholder — mirror of code_lang_breakdown for the repo field.
repo_breakdown: store.repo_breakdown()?,
})
}
fn collect_models(cfg: &Config, store: &kebab_store_sqlite::SqliteStore) -> Models {
Models {
// markdown parser only — pdf-page-v1 (P7) / image extractors (P6)
// maintain their own versions; surface those when SchemaV1.models
// becomes a multi-medium map (P+).
parser_version: kebab_parse_md::PARSER_VERSION.to_string(),
chunker_version: cfg.chunking.chunker_version.clone(),
// EmbeddingModelCfg uses `.model` (not `.id`) — adapt from plan.
embedding_version: cfg.models.embedding.model.clone(),
prompt_template_version: cfg.rag.prompt_template_version.clone(),
index_version: kebab_store_vector::INDEX_VERSION_STR.to_string(),
// corpus_revision returns u64 directly (no Result) — matches
// existing impl; treat 0 as the default for a fresh/unrevised store.
corpus_revision: store.corpus_revision(),
}
}
#[cfg(test)]
mod tests_stats_ext {
use super::*;
/// p10-1A-1: Stats must serialize `code_lang_breakdown` and
/// `repo_breakdown` so downstream consumers (MCP skill, Claude Code)
/// can branch on their presence.
#[test]
fn stats_includes_code_lang_and_repo_breakdown_fields() {
let stats = Stats::default();
let v = serde_json::to_value(&stats).unwrap();
assert!(
v.get("code_lang_breakdown").is_some(),
"Stats JSON must include code_lang_breakdown: {v}"
);
assert!(
v.get("repo_breakdown").is_some(),
"Stats JSON must include repo_breakdown: {v}"
);
// Empty BTreeMap serializes as `{}` — confirm it's an object, not null.
assert!(
v["code_lang_breakdown"].is_object(),
"code_lang_breakdown must be an object: {v}"
);
assert!(
v["repo_breakdown"].is_object(),
"repo_breakdown must be an object: {v}"
);
}
#[test]
fn stats_includes_breakdowns_and_bytes_on_fresh_corpus() {
let dir = tempfile::tempdir().unwrap();
let mut cfg = kebab_config::Config::defaults();
cfg.storage.data_dir = dir.path().to_string_lossy().into_owned();
// Bring up migrations so the sqlite file is created.
let store = kebab_store_sqlite::SqliteStore::open(&cfg).unwrap();
store.run_migrations().unwrap();
drop(store);
let s = schema_with_config(&cfg).unwrap();
// 5 keys padded.
assert_eq!(s.stats.media_breakdown.len(), 5);
assert_eq!(s.stats.media_breakdown.get("markdown"), Some(&0));
assert_eq!(s.stats.media_breakdown.get("pdf"), Some(&0));
// lang map empty on empty corpus.
assert!(s.stats.lang_breakdown.is_empty());
// sqlite bytes positive after migrations, lancedb 0.
assert!(s.stats.index_bytes.sqlite > 0);
assert_eq!(s.stats.index_bytes.lancedb, 0);
assert_eq!(s.stats.stale_doc_count, 0);
}
}

View File

@@ -0,0 +1,77 @@
//! p9-fb-32 staleness helpers.
use time::{Duration, OffsetDateTime};
use kebab_core::SearchHit;
/// Returns `true` iff `now - indexed_at > threshold_days * 24h`.
/// `threshold_days = 0` always returns `false` (feature disabled).
/// Strict `>` so that exactly `threshold_days` old returns `false`.
///
/// p9-fb-32: mirrored in `kebab_rag::pipeline::compute_stale` (dep-boundary
/// rule prevents `kebab-rag → kebab-app`). Update both together.
pub fn compute_stale(
indexed_at: OffsetDateTime,
now: OffsetDateTime,
threshold_days: u32,
) -> bool {
if threshold_days == 0 {
return false;
}
let threshold = Duration::days(i64::from(threshold_days));
(now - indexed_at) > threshold
}
/// Sets `stale` on each hit in place using `compute_stale`.
pub fn mark_stale_in_place(
hits: &mut [SearchHit],
now: OffsetDateTime,
threshold_days: u32,
) {
for h in hits {
h.stale = compute_stale(h.indexed_at, now, threshold_days);
}
}
#[cfg(test)]
mod tests {
use super::*;
use time::macros::datetime;
fn now() -> OffsetDateTime {
datetime!(2026-05-09 12:00:00 UTC)
}
#[test]
fn threshold_zero_always_fresh() {
let very_old = datetime!(2020-01-01 00:00:00 UTC);
assert!(!compute_stale(very_old, now(), 0));
}
#[test]
fn just_under_threshold_is_fresh() {
// 29 days, 23h, 59m old — under 30d.
let indexed = now() - Duration::days(29) - Duration::hours(23) - Duration::minutes(59);
assert!(!compute_stale(indexed, now(), 30));
}
#[test]
fn exactly_threshold_is_fresh() {
// strict `>` boundary: exactly 30d old is still fresh.
let indexed = now() - Duration::days(30);
assert!(!compute_stale(indexed, now(), 30));
}
#[test]
fn one_minute_past_threshold_is_stale() {
let indexed = now() - Duration::days(30) - Duration::minutes(1);
assert!(compute_stale(indexed, now(), 30));
}
#[test]
fn future_indexed_at_is_fresh() {
// clock skew safety: future timestamps must not be stale.
let future = now() + Duration::hours(1);
assert!(!compute_stale(future, now(), 30));
}
}

View File

@@ -0,0 +1,431 @@
//! p10-1A-2 Task 8: smoke test for Rust code ingest dispatch.
//!
//! Writes a single `.rs` file into a TempDir workspace, ingests it via
//! `kebab_app::ingest_with_config`, then searches for the symbol name and
//! asserts that the resulting `SearchHit` carries a `Citation::Code`
//! with the expected `lang`, `symbol`, and `line_start`.
//!
//! Mirrors the `pdf_pipeline.rs` harness: lexical-only (no AVX/fastembed),
//! no OCR / caption adapters needed.
mod common;
use common::{TestEnv, lexical_query};
use kebab_core::{Citation, IngestItemKind};
/// A `.rs` file with a single `pub fn add` symbol is ingested, and a
/// lexical search for "add" must return at least one `Citation::Code`
/// hit whose `lang == "rust"`, `symbol == Some("add")`, and
/// `line_start >= 1`.
#[test]
fn rust_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
// Write a minimal Rust file into the workspace root.
std::fs::write(
env.workspace_root.join("demo.rs"),
"/// adds two integers\npub fn add(a: i32, b: i32) -> i32 {\n a + b\n}\n",
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no errors expected: {report:?}");
let items = report.items.as_ref().expect("items present");
let code_item = items
.iter()
.find(|i| i.doc_path.0.ends_with("demo.rs"))
.expect("demo.rs item present");
assert_eq!(
code_item.kind,
IngestItemKind::New,
"first ingest must be New: {code_item:?}"
);
assert!(
code_item.block_count.unwrap_or(0) >= 1,
"at least one block expected: {code_item:?}"
);
assert!(
code_item.chunk_count.unwrap_or(0) >= 1,
"at least one chunk expected: {code_item:?}"
);
assert_eq!(
code_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-rust-v1"),
"parser_version must be code-rust-v1"
);
assert_eq!(
code_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-rust-ast-v1"),
"chunker_version must be code-rust-ast-v1"
);
// Lexical search for the symbol name "add".
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("add"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'add'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("rust"),
"citation.lang must be 'rust'"
);
assert_eq!(
symbol.as_deref(),
Some("add"),
"citation.symbol must be 'add'"
);
assert!(*line_start >= 1, "line_start must be ≥1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("rust"),
"SearchHit.code_lang must be 'rust'"
);
}
/// p10-1A-2 Task 8b: a code search hit must carry `SearchHit.repo` filled
/// from the document's `Metadata.repo` (which is set by `detect_repo` during
/// ingest). `detect_repo` returns the name of the directory that contains
/// `.git/`, so we `git init` the workspace root before ingesting and then
/// assert that `h.repo == Some("workspace")`.
#[test]
fn rust_code_search_hit_has_repo() {
let env = TestEnv::lexical_only();
// `detect_repo` walks up from the file looking for `.git/`.
// Initialise a bare git repo at the workspace root so it is
// discoverable. We only need the `.git/` directory — no commits
// required.
let git_status = std::process::Command::new("git")
.args(["init", "--quiet"])
.arg(env.workspace_root.as_os_str())
.status()
.expect("git init");
assert!(git_status.success(), "git init must succeed");
std::fs::write(
env.workspace_root.join("repo_demo.rs"),
"/// multiplies two integers\npub fn mul(a: i32, b: i32) -> i32 {\n a * b\n}\n",
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert_eq!(report.errors, 0, "no ingest errors: {report:?}");
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("mul"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'mul'");
// The workspace root directory is named "workspace" by `TestEnv`.
let expected_repo = env
.workspace_root
.file_name()
.and_then(|n| n.to_str())
.map(str::to_owned);
assert_eq!(
h.repo,
expected_repo,
"SearchHit.repo must match the workspace dir name (detect_repo result)"
);
// Also sanity-check code_lang is still filled.
assert_eq!(
h.code_lang.as_deref(),
Some("rust"),
"SearchHit.code_lang must be 'rust'"
);
}
/// p10-1b Task G: a `.py` file in a sub-directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="python"`,
/// `symbol="kebab_eval.metrics.compute_mrr"`, and `line_start >= 1`.
/// The sub-directory (`kebab_eval/`) ensures `module_path_for_python`
/// produces a non-empty prefix so the fully-qualified symbol assertion
/// exercises the prefix wiring end-to-end.
#[test]
fn python_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let module_dir = env.workspace_root.join("kebab_eval");
std::fs::create_dir_all(&module_dir).unwrap();
std::fs::write(
module_dir.join("metrics.py"),
"\"\"\"compute metrics.\"\"\"\ndef compute_mrr(scores):\n return sum(scores) / max(len(scores), 1)\n",
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert!(report.new >= 1, "python file ingested: {report:?}");
let items = report.items.as_ref().expect("items present");
let py_item = items
.iter()
.find(|i| i.doc_path.0.ends_with("metrics.py"))
.expect("metrics.py item");
assert_eq!(
py_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-python-v1"),
"parser_version must be code-python-v1"
);
assert_eq!(
py_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-python-ast-v1"),
"chunker_version must be code-python-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("compute_mrr"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'compute_mrr'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("python"),
"citation.lang must be 'python'"
);
assert_eq!(
symbol.as_deref(),
Some("kebab_eval.metrics.compute_mrr"),
"citation.symbol must be 'kebab_eval.metrics.compute_mrr'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("python"),
"SearchHit.code_lang must be 'python'"
);
}
/// p10-1b Task J: a `.ts` file in a sub-directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="typescript"`,
/// `symbol="src/Foo.Foo.bar"`, and `line_start >= 1`.
/// The sub-directory (`src/`) ensures `module_path_for_tsjs` produces
/// a non-empty prefix so the fully-qualified symbol assertion exercises
/// the prefix wiring end-to-end.
#[test]
fn typescript_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let src_dir = env.workspace_root.join("src");
std::fs::create_dir_all(&src_dir).unwrap();
std::fs::write(
src_dir.join("Foo.ts"),
"export class Foo {\n bar(): number { return 42; }\n}\n",
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert!(report.new >= 1, "ts file ingested: {report:?}");
let items = report.items.as_ref().expect("items present");
let ts_item = items
.iter()
.find(|i| i.doc_path.0.ends_with("Foo.ts"))
.expect("Foo.ts item");
assert_eq!(
ts_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-ts-v1"),
"parser_version must be code-ts-v1"
);
assert_eq!(
ts_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-ts-ast-v1"),
"chunker_version must be code-ts-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("bar"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'bar'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("typescript"),
"citation.lang must be 'typescript'"
);
assert_eq!(
symbol.as_deref(),
Some("src/Foo.Foo.bar"),
"citation.symbol must be 'src/Foo.Foo.bar'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("typescript"),
"SearchHit.code_lang must be 'typescript'"
);
}
/// p10-1b Task L: a `.js` file in a sub-directory is ingested and the
/// resulting `Citation::Code` hit must carry `lang="javascript"`,
/// `symbol="src/Bar.Bar.baz"`, and `line_start >= 1`.
/// The sub-directory (`src/`) ensures `module_path_for_tsjs` produces
/// a non-empty prefix so the fully-qualified symbol assertion exercises
/// the prefix wiring end-to-end.
#[test]
fn javascript_file_ingests_and_searches_as_code_citation() {
let env = TestEnv::lexical_only();
let src_dir = env.workspace_root.join("src");
std::fs::create_dir_all(&src_dir).unwrap();
std::fs::write(
src_dir.join("Bar.js"),
"export class Bar {\n baz() { return 7; }\n}\n",
)
.unwrap();
let report =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false)
.expect("ingest must succeed");
assert!(report.new >= 1, "js file ingested: {report:?}");
let items = report.items.as_ref().expect("items present");
let js_item = items
.iter()
.find(|i| i.doc_path.0.ends_with("Bar.js"))
.expect("Bar.js item");
assert_eq!(
js_item.parser_version.as_ref().map(|p| p.0.as_str()),
Some("code-js-v1"),
"parser_version must be code-js-v1"
);
assert_eq!(
js_item.chunker_version.as_ref().map(|c| c.0.as_str()),
Some("code-js-ast-v1"),
"chunker_version must be code-js-ast-v1"
);
let hits = kebab_app::search_with_config(env.config.clone(), lexical_query("baz"))
.expect("search must succeed");
let h = hits
.iter()
.find(|h| matches!(&h.citation, Citation::Code { .. }))
.expect("at least one Citation::Code hit for 'baz'");
match &h.citation {
Citation::Code {
lang,
symbol,
line_start,
..
} => {
assert_eq!(
lang.as_deref(),
Some("javascript"),
"citation.lang must be 'javascript'"
);
assert_eq!(
symbol.as_deref(),
Some("src/Bar.Bar.baz"),
"citation.symbol must be 'src/Bar.Bar.baz'"
);
assert!(*line_start >= 1, "line_start must be >=1");
}
_ => unreachable!(),
}
assert_eq!(
h.code_lang.as_deref(),
Some("javascript"),
"SearchHit.code_lang must be 'javascript'"
);
}
/// Re-ingesting the same `.rs` file without changes must report
/// `Unchanged` (incremental-skip path exercised).
#[test]
fn rust_file_re_ingest_is_unchanged() {
let env = TestEnv::lexical_only();
std::fs::write(
env.workspace_root.join("stable.rs"),
"pub fn noop() {}\n",
)
.unwrap();
let r1 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let item1 = r1
.items
.as_ref()
.unwrap()
.iter()
.find(|i| i.doc_path.0.ends_with("stable.rs"))
.cloned()
.unwrap();
assert_eq!(item1.kind, IngestItemKind::New);
let r2 =
kebab_app::ingest_with_config(env.config.clone(), env.scope(), false).unwrap();
let item2 = r2
.items
.unwrap()
.into_iter()
.find(|i| i.doc_path.0.ends_with("stable.rs"))
.unwrap();
assert_eq!(
item2.kind,
IngestItemKind::Unchanged,
"identical bytes → Unchanged"
);
assert_eq!(item2.doc_id, item1.doc_id);
}

View File

@@ -75,10 +75,41 @@ impl TestEnv {
pub fn scope(&self) -> kebab_core::SourceScope {
kebab_core::SourceScope {
root: self.workspace_root.clone(),
include: self.config.workspace.include.clone(),
exclude: self.config.workspace.exclude.clone(),
..Default::default()
}
}
/// p9-fb-34 alias — tests added in fb-34 invoke `TestEnv::new()`
/// per the plan; route to the existing lexical-only constructor
/// so the lane stays AVX-free without churning all the existing
/// callers.
pub fn new() -> Self {
Self::lexical_only()
}
/// p9-fb-34: open a fresh `App` against this env's config. Used
/// by integration tests that need to call `App::search_with_opts`
/// directly. Caller can invoke this multiple times to simulate
/// re-opening the binary after a corpus revision bump.
pub fn app(&self) -> kebab_app::App {
kebab_app::App::open_with_config(self.config.clone())
.expect("App::open_with_config")
}
}
/// p9-fb-34: write `content` into the env's workspace at
/// `relative_path`, then run a full ingest so the document is
/// searchable. Mirrors the convenience helpers used by other
/// `TestEnv`-driven crates.
pub fn ingest_md(env: &TestEnv, relative_path: &str, content: &str) {
let path = env.workspace_root.join(relative_path);
if let Some(parent) = path.parent() {
std::fs::create_dir_all(parent).expect("create parent dirs");
}
std::fs::write(&path, content).expect("write workspace file");
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true)
.expect("ingest_with_config");
}
/// Test helper: build a `SearchQuery` for lexical mode at k=10. Used
@@ -94,6 +125,29 @@ pub fn lexical_query(text: &str) -> kebab_core::SearchQuery {
}
}
/// p9-fb-32: rewrite `documents.updated_at` for one workspace path
/// to `now - days_ago` (RFC3339 UTC). Used by staleness integration
/// tests to simulate aged-out docs without faking system time. Caller
/// is responsible for ingesting the doc *before* calling this — the
/// row must already exist.
pub fn backdate_document_updated_at(env: &TestEnv, workspace_path: &str, days_ago: i64) {
let backdated = (time::OffsetDateTime::now_utc() - time::Duration::days(days_ago))
.format(&time::format_description::well_known::Rfc3339)
.expect("format backdated updated_at");
let db_path = PathBuf::from(&env.config.storage.data_dir).join("kebab.sqlite");
let conn = rusqlite::Connection::open(&db_path).expect("open kebab.sqlite");
let updated = conn
.execute(
"UPDATE documents SET updated_at = ?1 WHERE workspace_path = ?2",
rusqlite::params![backdated, workspace_path],
)
.expect("UPDATE documents.updated_at");
assert_eq!(
updated, 1,
"backdate_document_updated_at: expected to update exactly 1 row for {workspace_path}, got {updated}"
);
}
fn copy_fixture_workspace(dest: &Path) {
let src = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")

View File

@@ -0,0 +1,24 @@
//! p9-fb-34: cursor encode/decode round-trip + corpus_revision mismatch.
use kebab_app::cursor;
#[test]
fn cursor_roundtrip_preserves_offset() {
let encoded = cursor::encode(5, "rev-abc");
let offset = cursor::decode(&encoded, "rev-abc").unwrap();
assert_eq!(offset, 5);
}
#[test]
fn cursor_decode_rejects_mismatched_revision() {
let encoded = cursor::encode(7, "rev-old");
let err = cursor::decode(&encoded, "rev-new").unwrap_err();
assert_eq!(err.code, "stale_cursor");
assert!(err.message.contains("rev-old") || err.message.contains("rev-new"));
}
#[test]
fn cursor_decode_rejects_garbage_input() {
let err = cursor::decode("not-base64!!!", "any").unwrap_err();
assert_eq!(err.code, "stale_cursor");
}

View File

@@ -0,0 +1,329 @@
//! p9-fb-35 App::fetch integration tests.
mod common;
use kebab_app::App;
use kebab_core::{FetchKind, FetchOpts, FetchQuery};
fn open(env: &common::TestEnv) -> App {
env.app()
}
#[test]
fn fetch_chunk_returns_target_only_when_no_context() {
let env = common::TestEnv::new();
common::ingest_md(&env, "a.md", "# Title\n\nFirst paragraph.\n\n## Section\n\nSecond.\n");
let app = open(&env);
// Find a chunk via search to obtain its id.
let q = kebab_core::SearchQuery {
text: "First".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let chunk_id = hits[0].chunk_id.clone();
let result = app
.fetch(FetchQuery::Chunk(chunk_id), FetchOpts::default())
.unwrap();
assert_eq!(result.kind, FetchKind::Chunk);
assert!(result.chunk.is_some(), "target chunk populated");
assert!(result.context_before.is_empty());
assert!(result.context_after.is_empty());
assert!(!result.truncated);
}
#[test]
fn fetch_chunk_with_context_returns_neighbors() {
let env = common::TestEnv::new();
let body = "# H1\n\nA1\n\n# H2\n\nA2\n\n# H3\n\nA3\n\n# H4\n\nA4\n\n# H5\n\nA5\n";
common::ingest_md(&env, "multi.md", body);
let app = env.app();
let q = kebab_core::SearchQuery {
text: "A3".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let chunk_id = hits[0].chunk_id.clone();
let result = app
.fetch(
FetchQuery::Chunk(chunk_id),
FetchOpts {
context: Some(2),
max_tokens: None,
},
)
.unwrap();
assert_eq!(result.kind, FetchKind::Chunk);
assert!(result.chunk.is_some());
let total = result.context_before.len() + result.context_after.len();
assert!(total >= 1, "at least one neighbor expected");
assert!(total <= 4, "context capped at +-2 ⇒ max 4 neighbors");
}
#[test]
fn fetch_chunk_unknown_id_returns_chunk_not_found() {
let env = common::TestEnv::new();
let app = env.app();
let err = app
.fetch(
FetchQuery::Chunk(kebab_core::ChunkId("nonexistent-id".to_string())),
FetchOpts::default(),
)
.unwrap_err();
let msg = err.to_string();
assert!(
msg.contains("chunk_not_found") || msg.contains("nonexistent-id"),
"expected chunk_not_found error, got: {msg}"
);
}
#[test]
fn fetch_doc_returns_serialized_markdown() {
let env = common::TestEnv::new();
let body = "# Heading One\n\nFirst paragraph.\n\n## Sub\n\nSecond.\n";
common::ingest_md(&env, "doc.md", body);
let app = env.app();
// Discover doc_id via search hit (avoids depending on list_docs API shape).
let q = kebab_core::SearchQuery {
text: "First".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let doc_id = hits[0].doc_id.clone();
let result = app
.fetch(FetchQuery::Doc(doc_id), FetchOpts::default())
.unwrap();
assert_eq!(result.kind, FetchKind::Doc);
let text = result.text.expect("doc text");
assert!(text.contains("Heading One"), "doc text contains heading: {text:?}");
assert!(text.contains("First paragraph"), "doc text contains body");
assert!(!result.truncated);
}
#[test]
fn fetch_doc_unknown_id_returns_doc_not_found() {
let env = common::TestEnv::new();
let app = env.app();
let err = app
.fetch(
FetchQuery::Doc(kebab_core::DocumentId("nonexistent-doc".to_string())),
FetchOpts::default(),
)
.unwrap_err();
assert!(err.to_string().contains("doc_not_found"), "got: {err}");
}
#[test]
fn fetch_doc_with_max_tokens_truncates() {
let env = common::TestEnv::new();
let p = "Lorem ipsum dolor sit amet consectetur adipiscing elit. ".repeat(20);
let body = format!("# Big\n\n{p}\n");
common::ingest_md(&env, "big.md", &body);
let app = env.app();
let q = kebab_core::SearchQuery {
text: "Lorem".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let doc_id = hits[0].doc_id.clone();
let result = app
.fetch(
FetchQuery::Doc(doc_id),
FetchOpts {
context: None,
max_tokens: Some(20), // ~80 chars
},
)
.unwrap();
assert!(result.truncated);
let text = result.text.expect("doc text");
assert!(text.chars().count() <= 100, "trimmed text len {}", text.chars().count());
}
#[test]
fn fetch_span_returns_line_range() {
let env = common::TestEnv::new();
// Use a list so the canonical-to-markdown roundtrip emits 5
// single-line entries joined by `\n` (paragraphs would be joined by
// `\n\n`, and CommonMark soft breaks inside one paragraph collapse to
// spaces — see crates/kebab-parse-md/src/blocks.rs `Event::SoftBreak`).
let body = "- Line one.\n- Line two.\n- Line three.\n- Line four.\n- Line five.\n";
common::ingest_md(&env, "lines.md", body);
let app = env.app();
let q = kebab_core::SearchQuery {
text: "Line".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let doc_id = hits[0].doc_id.clone();
let result = app
.fetch(
FetchQuery::Span {
doc_id,
line_start: 2,
line_end: 4,
},
FetchOpts::default(),
)
.unwrap();
assert_eq!(result.kind, FetchKind::Span);
let text = result.text.expect("span text");
let line_count = text.lines().count();
assert_eq!(line_count, 3, "span should be 3 lines: {text:?}");
assert_eq!(result.line_start, Some(2));
assert_eq!(result.line_end, Some(4));
assert_eq!(result.effective_end, Some(4));
assert!(!result.truncated);
}
#[test]
fn fetch_span_clamps_line_end_when_out_of_range() {
let env = common::TestEnv::new();
common::ingest_md(&env, "short.md", "Line one.\nLine two.\n");
let app = env.app();
let q = kebab_core::SearchQuery {
text: "Line".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let doc_id = hits[0].doc_id.clone();
let result = app
.fetch(
FetchQuery::Span {
doc_id,
line_start: 1,
line_end: 999,
},
FetchOpts::default(),
)
.unwrap();
let text = result.text.expect("span text");
let actual_lines = text.lines().count();
assert_eq!(result.effective_end, Some(actual_lines as u32));
assert!(actual_lines < 999);
}
#[test]
fn fetch_span_invalid_input_when_zero_lines() {
let env = common::TestEnv::new();
common::ingest_md(&env, "a.md", "Line one.\n");
let app = env.app();
let q = kebab_core::SearchQuery {
text: "Line".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let doc_id = hits[0].doc_id.clone();
let err = app
.fetch(
FetchQuery::Span {
doc_id,
line_start: 0,
line_end: 0,
},
FetchOpts::default(),
)
.unwrap_err();
assert!(err.to_string().contains("invalid_input"), "got: {err}");
}
#[test]
fn fetch_span_line_start_beyond_total_returns_empty_text() {
let env = common::TestEnv::new();
let body = "- Line one.\n- Line two.\n";
common::ingest_md(&env, "two_lines.md", body);
let app = env.app();
let q = kebab_core::SearchQuery {
text: "Line".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let doc_id = hits[0].doc_id.clone();
let result = app
.fetch(
FetchQuery::Span {
doc_id,
line_start: 100,
line_end: 200,
},
FetchOpts::default(),
)
.unwrap();
let text = result.text.expect("text field");
assert!(text.is_empty(), "out-of-range request returns empty text");
assert!(
!result.truncated,
"out-of-range is NOT truncated (budget-only flag)"
);
}
#[test]
fn fetch_chunk_context_at_first_chunk_clamps_lower_bound() {
let env = common::TestEnv::new();
// Multi-chunk markdown so context ±N has neighbors.
let body =
"# H1\n\nFirst chunk text body.\n\n# H2\n\nSecond chunk.\n\n# H3\n\nThird chunk.\n";
common::ingest_md(&env, "boundary.md", body);
let app = env.app();
let q = kebab_core::SearchQuery {
text: "First".to_string(),
mode: kebab_core::SearchMode::Lexical,
k: 1,
filters: kebab_core::SearchFilters::default(),
};
let hits = app.search(q).unwrap();
let chunk_id = hits[0].chunk_id.clone();
let result = app
.fetch(
FetchQuery::Chunk(chunk_id),
FetchOpts {
context: Some(2),
max_tokens: None,
},
)
.unwrap();
// p9-fb-35 R2: doc has 3 chunks; ±2 should clamp the total
// neighbor count to ≤ 2 + 1 (= excludes target).
//
// ⚠ Strict "first-chunk → context_before is empty" cannot be
// asserted here yet because chunks.ordinal column does not exist
// — `list_chunk_ids_for_doc` orders by `(created_at, chunk_id)`
// and chunk_id is a blake3 hash, so the "First chunk" content
// may land at any hash-order position within the doc. The clamp
// logic itself is correct (target_idx ± n → [0..len]); we just
// can't pin which chunk is hash-order-first. Tracked as
// follow-up: V007 chunks.ordinal migration.
let total = result.context_before.len() + result.context_after.len();
assert!(
total <= 2,
"doc with 3 chunks ±2 → at most 2 neighbors (excludes target), got {total}"
);
}

View File

@@ -33,10 +33,8 @@ fn write_red_png(root: &Path, name: &str) -> std::path::PathBuf {
fn cfg_with_image_pipeline(env: &TestEnv, mock_endpoint: &str) -> Config {
let mut cfg = env.config.clone();
// Ensure image assets are scanned.
cfg.workspace
.include
.push("**/*.png".to_string());
// p9-fb-25: workspace.include removed; extension routing is now
// handled by extractor matching alone (no config knob).
cfg.image.ocr.enabled = true;
cfg.image.ocr.endpoint = Some(mock_endpoint.to_string());
cfg.image.ocr.model = "vision-mock:1b".to_string();
@@ -261,7 +259,8 @@ async fn image_indexed_with_filename_when_ocr_and_caption_disabled() {
let env = TestEnv::lexical_only();
write_red_png(&env.workspace_root, "raw.png");
let mut cfg = env.config.clone();
cfg.workspace.include.push("**/*.png".to_string());
// p9-fb-25: workspace.include removed; extension routing is now
// handled by extractor matching alone (no config knob).
cfg.image.ocr.enabled = false;
cfg.image.caption.enabled = false;
@@ -326,7 +325,8 @@ async fn garbage_png_increments_errors_counter_exactly_once() {
)
.expect("write garbage fixture");
let mut cfg = env.config.clone();
cfg.workspace.include.push("**/*.png".to_string());
// p9-fb-25: workspace.include removed; extension routing is now
// handled by extractor matching alone (no config knob).
cfg.image.ocr.enabled = false;
cfg.image.caption.enabled = false;

View File

@@ -0,0 +1,111 @@
//! Integration: kebab_app::ingest_file_with_config copies external file
//! to _external/, ingests as single asset, idempotent on second call.
use std::fs;
use kebab_config::Config;
#[test]
fn ingest_file_copies_external_md_and_reports_new() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("notes");
let data = dir.path().join("data");
fs::create_dir_all(&workspace).unwrap();
fs::create_dir_all(&data).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.storage.data_dir = data.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
// Source file outside the workspace.
let external_src = dir.path().join("source.md");
fs::write(&external_src, "# Hello\n\nbody.").unwrap();
let report = kebab_app::ingest_file_with_config(cfg.clone(), &external_src).unwrap();
assert_eq!(report.scanned, 1, "{report:?}");
assert_eq!(report.new, 1, "{report:?}");
assert_eq!(report.unchanged, 0, "{report:?}");
// _external/ dir created, file copied with hash prefix.
let ext_dir = workspace.join("_external");
assert!(ext_dir.is_dir());
let entries: Vec<_> = fs::read_dir(&ext_dir)
.unwrap()
.filter_map(|e| e.ok())
.collect();
assert_eq!(entries.len(), 1, "exactly one file in _external/");
let name = entries[0].file_name().to_string_lossy().into_owned();
assert!(name.ends_with(".md"));
// .kebabignore has _external/ line.
let ki = fs::read_to_string(workspace.join(".kebabignore")).unwrap();
assert!(ki.lines().any(|l| l.trim() == "_external/"));
}
#[test]
fn ingest_file_idempotent_on_second_call() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("notes");
let data = dir.path().join("data");
fs::create_dir_all(&workspace).unwrap();
fs::create_dir_all(&data).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.storage.data_dir = data.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
let src = dir.path().join("doc.md");
fs::write(&src, "# A\n\nbody.").unwrap();
let r1 = kebab_app::ingest_file_with_config(cfg.clone(), &src).unwrap();
assert_eq!(r1.new, 1);
let r2 = kebab_app::ingest_file_with_config(cfg.clone(), &src).unwrap();
assert_eq!(r2.new, 0, "{r2:?}");
assert_eq!(r2.unchanged, 1, "{r2:?}");
}
#[test]
fn ingest_file_errors_on_missing_path() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("notes");
let data = dir.path().join("data");
fs::create_dir_all(&workspace).unwrap();
fs::create_dir_all(&data).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.storage.data_dir = data.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
let nonexistent = dir.path().join("nope.md");
let err = kebab_app::ingest_file_with_config(cfg, &nonexistent).unwrap_err();
assert!(err.to_string().contains("does not exist"), "{err}");
}
#[test]
fn ingest_file_errors_on_unsupported_extension() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("notes");
let data = dir.path().join("data");
fs::create_dir_all(&workspace).unwrap();
fs::create_dir_all(&data).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.storage.data_dir = data.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
let docx = dir.path().join("doc.docx");
fs::write(&docx, b"fake docx bytes").unwrap();
let err = kebab_app::ingest_file_with_config(cfg, &docx).unwrap_err();
assert!(err.to_string().contains("unsupported extension"), "{err}");
assert!(err.to_string().contains(".docx") || err.to_string().contains("docx"), "{err}");
}

View File

@@ -0,0 +1,78 @@
//! Integration: kebab_app::ingest_stdin_with_config injects frontmatter,
//! writes to _external/, ingests as single asset.
use std::fs;
use kebab_config::Config;
fn fresh_cfg(dir: &std::path::Path) -> Config {
let workspace = dir.join("notes");
let data = dir.join("data");
fs::create_dir_all(&workspace).unwrap();
fs::create_dir_all(&data).unwrap();
let mut cfg = Config::defaults();
cfg.workspace.root = workspace.to_string_lossy().into_owned();
cfg.storage.data_dir = data.to_string_lossy().into_owned();
cfg.models.embedding.provider = "none".to_string();
cfg.models.embedding.dimensions = 0;
cfg
}
#[test]
fn ingest_stdin_writes_frontmatter_and_reports_new() {
let dir = tempfile::tempdir().unwrap();
let cfg = fresh_cfg(dir.path());
let report = kebab_app::ingest_stdin_with_config(
cfg.clone(),
"## Body content\n\nMore.",
"Article X",
Some("https://example.com/x"),
).unwrap();
assert_eq!(report.new, 1, "{report:?}");
// _external/ contains exactly one .md file with frontmatter.
let ext_dir = std::path::PathBuf::from(&cfg.workspace.root).join("_external");
let entries: Vec<_> = fs::read_dir(&ext_dir).unwrap()
.filter_map(|e| e.ok())
.collect();
assert_eq!(entries.len(), 1);
let content = fs::read_to_string(entries[0].path()).unwrap();
assert!(content.starts_with("---\n"));
assert!(content.contains("title: \"Article X\""));
assert!(content.contains("source_uri: \"https://example.com/x\""));
assert!(content.contains("## Body content"));
}
#[test]
fn ingest_stdin_without_source_uri() {
let dir = tempfile::tempdir().unwrap();
let cfg = fresh_cfg(dir.path());
let report = kebab_app::ingest_stdin_with_config(
cfg.clone(),
"## Body",
"Title",
None,
).unwrap();
assert_eq!(report.new, 1);
let ext_dir = std::path::PathBuf::from(&cfg.workspace.root).join("_external");
let entries: Vec<_> = fs::read_dir(&ext_dir).unwrap()
.filter_map(|e| e.ok())
.collect();
let content = fs::read_to_string(entries[0].path()).unwrap();
assert!(content.contains("title: \"Title\""));
assert!(!content.contains("source_uri"));
}
#[test]
fn ingest_stdin_errors_on_existing_frontmatter() {
let dir = tempfile::tempdir().unwrap();
let cfg = fresh_cfg(dir.path());
let body = "---\ntitle: Already\n---\n\n## Body";
let err = kebab_app::ingest_stdin_with_config(cfg, body, "New", None).unwrap_err();
assert!(err.to_string().contains("already has frontmatter"), "{err}");
}

View File

@@ -0,0 +1,34 @@
//! p9-fb-25 task 3: `kebab init` produces config.toml with a header
//! comment listing the four supported extensions (md / png / jpg+jpeg
//! / pdf) so a user editing the config knows what's processable.
#[test]
fn init_workspace_header_lists_supported_extensions() {
let tmp = tempfile::tempdir().expect("tempdir");
// SAFETY: Rust 2024 marks set_var as unsafe — wrap in unsafe block.
// Each test sets process-wide XDG_CONFIG_HOME to point at the
// tempdir; init_workspace writes config.toml relative to it.
unsafe {
std::env::set_var("XDG_CONFIG_HOME", tmp.path());
// Same dir for data + cache to avoid touching real user paths.
std::env::set_var("XDG_DATA_HOME", tmp.path().join("data"));
std::env::set_var("XDG_CACHE_HOME", tmp.path().join("cache"));
std::env::set_var("XDG_STATE_HOME", tmp.path().join("state"));
}
kebab_app::init_workspace(true).expect("init_workspace");
let cfg_path = kebab_config::Config::xdg_config_path();
let body = std::fs::read_to_string(&cfg_path).unwrap_or_else(|e| {
panic!("read config at {}: {e}", cfg_path.display())
});
assert!(
body.contains("처리 가능한 형식"),
"header lists supported types section: body=\n{body}"
);
assert!(body.contains("Markdown: .md"), "md listed");
assert!(body.contains(".png .jpg .jpeg"), "image extensions listed");
assert!(body.contains("PDF: .pdf"), "pdf listed");
assert!(
!body.contains("workspace.include"),
"no leftover include reference"
);
}

View File

@@ -121,7 +121,8 @@ fn write_pdf(root: &Path, name: &str, bytes: &[u8]) -> std::path::PathBuf {
fn cfg_with_pdf(env: &TestEnv) -> Config {
let mut cfg = env.config.clone();
cfg.workspace.include.push("**/*.pdf".to_string());
// p9-fb-25: workspace.include removed; extension routing is now
// handled by extractor matching alone (no config knob).
// PDF ingest does not need OCR / caption / LM — leave defaults
// (ocr.enabled=false, caption.enabled=false). The image pipeline
// construction step skips both adapters.

View File

@@ -0,0 +1,123 @@
//! Integration test: kebab_app::schema_with_config returns a SchemaV1
//! that is internally consistent with a freshly-ingested TempDir KB.
use std::fs;
use kebab_config::Config;
use kebab_core::SourceScope;
fn minimal_config(data_dir: &std::path::Path, workspace_root: &std::path::Path) -> Config {
let mut config = Config::defaults();
config.workspace.root = workspace_root.to_string_lossy().into_owned();
config.workspace.exclude.clear();
config.storage.data_dir = data_dir.to_string_lossy().into_owned();
config.storage.model_dir = data_dir.join("models").to_string_lossy().into_owned();
config.models.embedding.provider = "none".to_string();
config.models.embedding.dimensions = 0;
config.chunking.target_tokens = 80;
config.chunking.overlap_tokens = 20;
config
}
fn minimal_scope(workspace_root: &std::path::Path) -> SourceScope {
SourceScope {
root: workspace_root.to_path_buf(),
include: vec![],
exclude: vec![],
}
}
#[test]
fn schema_report_reflects_freshly_ingested_kb() {
let temp = tempfile::tempdir().expect("tempdir");
let workspace_root = temp.path().join("workspace");
let data_dir = temp.path().join("data");
fs::create_dir_all(&workspace_root).unwrap();
fs::create_dir_all(&data_dir).unwrap();
fs::write(workspace_root.join("a.md"), "# A\n\nbody A.").unwrap();
fs::write(workspace_root.join("b.md"), "# B\n\nbody B.").unwrap();
let config = minimal_config(&data_dir, &workspace_root);
let _report =
kebab_app::ingest_with_config(config.clone(), minimal_scope(&workspace_root), false)
.unwrap();
let schema = kebab_app::schema_with_config(&config).unwrap();
assert!(!schema.kebab_version.is_empty());
assert!(
schema.wire.schemas.contains(&"schema.v1".to_string()),
"schema.v1 missing from wire.schemas: {:?}",
schema.wire.schemas
);
assert!(
schema.wire.schemas.contains(&"error.v1".to_string()),
"error.v1 missing from wire.schemas: {:?}",
schema.wire.schemas
);
assert!(schema.capabilities.json_mode);
assert!(!schema.capabilities.streaming_ask);
assert!(
schema.capabilities.mcp_server,
"mcp_server should be true after fb-30",
);
assert_eq!(
schema.stats.doc_count, 2,
"expected 2 docs (a.md + b.md): {:?}",
schema.stats
);
assert!(
schema.stats.last_ingest_at.is_some(),
"last_ingest_at must be set after ingest: {:?}",
schema.stats
);
assert!(
schema.stats.chunk_count >= 2,
"expected ≥2 chunks (a.md + b.md): {:?}",
schema.stats
);
assert_eq!(
schema.stats.asset_count, 2,
"expected 2 assets (a.md + b.md): {:?}",
schema.stats
);
}
#[test]
fn schema_report_on_empty_kb_has_zero_counts() {
// An empty workspace dir with no .md files: ingest_with_config scans 0
// files but still creates + migrates kebab.sqlite. This seeds the DB so
// open_existing (used inside schema_with_config) succeeds and returns
// all-zero counts.
let temp = tempfile::tempdir().expect("tempdir");
let workspace_root = temp.path().join("workspace");
let data_dir = temp.path().join("data");
fs::create_dir_all(&workspace_root).unwrap();
fs::create_dir_all(&data_dir).unwrap();
let config = minimal_config(&data_dir, &workspace_root);
// Run ingest over the empty workspace — creates kebab.sqlite, runs
// migrations, records 0 docs. schema_with_config can then open_existing.
let report =
kebab_app::ingest_with_config(config.clone(), minimal_scope(&workspace_root), false)
.unwrap();
assert_eq!(report.new, 0, "empty workspace should yield 0 new docs");
let schema = kebab_app::schema_with_config(&config).unwrap();
assert_eq!(
schema.stats.doc_count, 0,
"empty KB doc_count: {:?}",
schema.stats
);
assert_eq!(
schema.stats.chunk_count, 0,
"empty KB chunk_count: {:?}",
schema.stats
);
assert!(
schema.stats.last_ingest_at.is_none(),
"last_ingest_at must be None when no docs ingested: {:?}",
schema.stats
);
}

View File

@@ -0,0 +1,165 @@
//! p9-fb-34: App::search_with_opts integration tests.
mod common;
use kebab_app::SearchResponse;
use kebab_core::{SearchFilters, SearchMode, SearchOpts, SearchQuery};
fn lex(text: &str, k: usize) -> SearchQuery {
SearchQuery {
text: text.to_string(),
mode: SearchMode::Lexical,
k,
filters: SearchFilters::default(),
}
}
#[test]
fn search_with_opts_no_budget_matches_search() {
let env = common::TestEnv::new();
common::ingest_md(&env, "a.md", "# T\n\napples are red\n");
let app = env.app();
let baseline = app.search(lex("apples", 5)).unwrap();
let resp: SearchResponse = app
.search_with_opts(lex("apples", 5), SearchOpts::default())
.unwrap();
assert_eq!(resp.hits.len(), baseline.len());
assert!(!resp.truncated);
assert!(resp.next_cursor.is_none(), "k=5 against 1 doc → no next page");
}
#[test]
fn budget_truncates_snippets_when_below_threshold() {
let env = common::TestEnv::new();
let body: String = "rust ownership is a memory model. ".repeat(10);
common::ingest_md(&env, "a.md", &format!("# T\n\n{body}\n"));
let app = env.app();
let unrestricted = app.search(lex("rust", 5)).unwrap();
let unrestricted_chars: usize = unrestricted.iter().map(|h| h.snippet.chars().count()).sum();
let resp = app
.search_with_opts(
lex("rust", 5),
SearchOpts {
max_tokens: Some(50),
snippet_chars: None,
cursor: None,
trace: false,
},
)
.unwrap();
let limited_chars: usize = resp.hits.iter().map(|h| h.snippet.chars().count()).sum();
assert!(resp.truncated, "small budget must trip truncation");
assert!(limited_chars < unrestricted_chars, "snippet should shrink");
assert!(!resp.hits.is_empty(), "always retain ≥1 hit");
}
#[test]
fn cursor_paginates_to_next_page() {
let env = common::TestEnv::new();
for i in 0..6 {
common::ingest_md(&env, &format!("d{i}.md"), &format!("# T{i}\n\nrust topic {i}\n"));
}
let app = env.app();
let page1 = app
.search_with_opts(lex("rust", 2), SearchOpts::default())
.unwrap();
assert_eq!(page1.hits.len(), 2);
let cursor = page1.next_cursor.expect("more hits available");
let page2 = app
.search_with_opts(
lex("rust", 2),
SearchOpts {
max_tokens: None,
snippet_chars: None,
cursor: Some(cursor),
trace: false,
},
)
.unwrap();
assert_eq!(page2.hits.len(), 2);
let p1_ids: std::collections::HashSet<_> =
page1.hits.iter().map(|h| h.chunk_id.0.clone()).collect();
let p2_ids: std::collections::HashSet<_> =
page2.hits.iter().map(|h| h.chunk_id.0.clone()).collect();
assert!(p1_ids.is_disjoint(&p2_ids), "page 2 must not repeat page 1 hits");
}
#[test]
fn cursor_rejected_after_corpus_revision_bump() {
let env = common::TestEnv::new();
common::ingest_md(&env, "a.md", "# T\n\napples\n");
let app = env.app();
let page1 = app
.search_with_opts(lex("apples", 1), SearchOpts::default())
.unwrap();
// p9-fb-34 round-1 review: replaced silent `if let Some(c) = ...`
// with `.expect(...)` so a fixture regression that breaks the
// cursor-emission contract fails loudly instead of passing vacuously.
let c = page1
.next_cursor
.expect("k=1 page must emit next_cursor — fixture too small if this fails");
common::ingest_md(&env, "b.md", "# B\n\nbananas\n");
let app2 = env.app();
let result = app2.search_with_opts(
lex("apples", 1),
SearchOpts {
max_tokens: None,
snippet_chars: None,
cursor: Some(c),
trace: false,
},
);
let err = result.unwrap_err();
assert!(
err.to_string().contains("stale_cursor"),
"must surface stale_cursor: {err}"
);
}
#[test]
fn max_tokens_zero_returns_one_hit_truncated() {
// p9-fb-34 round-1 review: pin the documented "≥1 hit floor"
// contract — even with `max_tokens=0` (an absurdly tight budget)
// the budget loop must keep one hit and flip `truncated: true`.
// Fixture intentionally seeds multiple matches so step 2 of the
// budget loop (pop hits to 1) actually fires.
let env = common::TestEnv::new();
for i in 0..3 {
common::ingest_md(
&env,
&format!("d{i}.md"),
&format!("# T{i}\n\napples are red {i}\n"),
);
}
let app = env.app();
let resp = app
.search_with_opts(
lex("apples", 5),
SearchOpts {
max_tokens: Some(0),
snippet_chars: None,
cursor: None,
trace: false,
},
)
.unwrap();
assert_eq!(resp.hits.len(), 1, "max_tokens=0 collapses to 1-hit floor");
assert!(resp.truncated);
// p9-fb-34 R2: cursor IS emitted on k-pop case so the popped
// hits remain reachable.
assert!(
resp.next_cursor.is_some(),
"k-pop truncation must still emit next_cursor; popped hits at offset+returned"
);
}

View File

@@ -0,0 +1,87 @@
//! p9-fb-32: `App::search` end-to-end staleness wiring.
//!
//! `compute_stale` itself is unit-tested in `kebab_app::staleness`; this
//! file proves the post-process actually fires through the full
//! retriever stack and that the cache-hit re-stamp respects the
//! configured threshold.
//!
//! All three tests run lexical-only (no AVX, no fastembed download).
mod common;
use common::TestEnv;
fn lexical_query_owner() -> kebab_core::SearchQuery {
common::lexical_query("ownership")
}
/// Fresh ingest at default 30-day threshold → no hit can be stale.
/// `documents.updated_at` is stamped at ingest time (now), so the
/// distance to `now_utc()` is sub-second.
#[test]
fn fresh_doc_is_not_stale_with_default_threshold() {
let env = TestEnv::lexical_only();
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
let app = kebab_app::App::open_with_config(env.config.clone()).unwrap();
let hits = app.search(lexical_query_owner()).unwrap();
assert!(!hits.is_empty(), "expected ≥1 hit for 'ownership'");
assert!(
hits.iter().all(|h| !h.stale),
"freshly-ingested doc must not be stale at default 30d threshold: {:?}",
hits.iter().map(|h| (h.doc_path.0.clone(), h.stale)).collect::<Vec<_>>()
);
}
/// `stale_threshold_days = 0` disables the feature even for very old
/// `documents.updated_at`. Backdate the row to a year ago, expect
/// `stale: false` on every hit.
#[test]
fn threshold_zero_disables_staleness() {
let mut env = TestEnv::lexical_only();
env.config.search.stale_threshold_days = 0;
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
common::backdate_document_updated_at(&env, "intro.md", 365);
let app = kebab_app::App::open_with_config(env.config.clone()).unwrap();
let hits = app.search(lexical_query_owner()).unwrap();
assert!(!hits.is_empty(), "expected ≥1 hit");
assert!(
hits.iter().all(|h| !h.stale),
"threshold=0 disables staleness even for year-old docs: {:?}",
hits.iter().map(|h| (h.doc_path.0.clone(), h.stale)).collect::<Vec<_>>()
);
}
/// At a 30-day threshold, a 60-day-old `documents.updated_at` must
/// surface as stale on the matching hit. (Other hits — fresh fixtures
/// not backdated — stay fresh, so we use `any` not `all`.)
#[test]
fn old_doc_marked_stale() {
let mut env = TestEnv::lexical_only();
env.config.search.stale_threshold_days = 30;
kebab_app::ingest_with_config(env.config.clone(), env.scope(), true).unwrap();
common::backdate_document_updated_at(&env, "intro.md", 60);
let app = kebab_app::App::open_with_config(env.config.clone()).unwrap();
let hits = app.search(lexical_query_owner()).unwrap();
assert!(!hits.is_empty(), "expected ≥1 hit");
let intro_hits: Vec<&kebab_core::SearchHit> = hits
.iter()
.filter(|h| h.doc_path.0.ends_with("intro.md"))
.collect();
assert!(
!intro_hits.is_empty(),
"expected ≥1 hit on intro.md (the backdated doc)"
);
assert!(
intro_hits.iter().all(|h| h.stale),
"60-day-old intro.md must be stale at 30d threshold: {:?}",
intro_hits
.iter()
.map(|h| (h.doc_path.0.clone(), h.stale))
.collect::<Vec<_>>()
);
}

View File

@@ -0,0 +1,43 @@
//! p9-fb-25 task 5: skipped per-asset items must carry a human-readable
//! reason in `warnings`, and the report's `skipped_by_extension` must
//! aggregate by lowercase extension.
mod common;
use common::TestEnv;
#[test]
fn unsupported_extension_skip_carries_warning_and_is_aggregated() {
let env = TestEnv::lexical_only();
let workspace_root = std::path::PathBuf::from(&env.config.workspace.root);
std::fs::write(workspace_root.join("legacy.docx"), b"unsupported").unwrap();
std::fs::write(workspace_root.join("Makefile"), b"unsupported").unwrap();
let report = kebab_app::ingest_with_config(
env.config.clone(),
env.scope(),
false,
).unwrap();
let items = report.items.as_ref().expect("items array populated");
let docx_item = items
.iter()
.find(|i| i.doc_path.0.ends_with("legacy.docx"))
.expect("docx in items");
assert_eq!(docx_item.kind, kebab_core::IngestItemKind::Skipped);
assert_eq!(
docx_item.warnings,
vec!["unsupported media type: .docx".to_string()],
);
let makefile_item = items
.iter()
.find(|i| i.doc_path.0.ends_with("Makefile"))
.expect("Makefile in items");
assert_eq!(makefile_item.kind, kebab_core::IngestItemKind::Skipped);
assert_eq!(
makefile_item.warnings,
vec!["unsupported media type: <no-ext>".to_string()],
);
assert_eq!(report.skipped_by_extension.get("docx").copied(), Some(1));
assert_eq!(report.skipped_by_extension.get("<no-ext>").copied(), Some(1));
}

View File

@@ -0,0 +1,322 @@
//! `code-js-ast-v1` — maps a tree-sitter-derived JavaScript AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-js-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeJsAstV1Chunker;
impl Chunker for CodeJsAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeJsAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeJsAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-js-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.js".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-js-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("javascript".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("javascript".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("javascript".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_js_ast_v1() {
assert_eq!(CodeJsAstV1Chunker.chunker_version(),
ChunkerVersion("code-js-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "function parse() {\n // x\n}"),
("Foo.double", 5, 7, "function double() {\n //\n return 0;\n}"),
]);
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-js-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" const x{i} = {i};")).collect::<Vec<_>>().join("\n");
let code = format!("function big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "function parse() {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeJsAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeJsAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "function parse() {}\n")]);
let base: Vec<String> = CodeJsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeJsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeJsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-python-ast-v1` — maps a tree-sitter-derived Python AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-python-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodePythonAstV1Chunker;
impl Chunker for CodePythonAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodePythonAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodePythonAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-python-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.py".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-python-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("python".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("python".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("python".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_python_ast_v1() {
assert_eq!(CodePythonAstV1Chunker.chunker_version(),
ChunkerVersion("code-python-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "def parse():\n pass\n # x"),
("Foo.double", 5, 7, "def double():\n #\n pass"),
]);
let chunks = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-python-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" x{i} = {i}")).collect::<Vec<_>>().join("\n");
let code = format!("def big():\n{body}\n");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "def parse(): pass")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodePythonAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodePythonAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "def parse(): pass\n")]);
let base: Vec<String> = CodePythonAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodePythonAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodePythonAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-rust-ast-v1` — maps a tree-sitter-derived Rust AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-rust-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeRustAstV1Chunker;
impl Chunker for CodeRustAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeRustAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeRustAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-rust-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.rs".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-rust-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("rust".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("rust".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("rust".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_rust_ast_v1() {
assert_eq!(CodeRustAstV1Chunker.chunker_version(),
ChunkerVersion("code-rust-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "pub fn parse() {}\n// x\n}"),
("Foo::double", 5, 7, "fn double() {}\n//\n}"),
]);
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-rust-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" let x{i} = {i};")).collect::<Vec<_>>().join("\n");
let code = format!("pub fn big() {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "fn parse(){}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeRustAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeRustAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "fn parse(){}\n}")]);
let base: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeRustAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeRustAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -0,0 +1,322 @@
//! `code-ts-ast-v1` — maps a tree-sitter-derived TypeScript AST
//! `CanonicalDocument` (one `Block::Code` per semantic unit, each with
//! `SourceSpan::Code`) to chunks 1:1. A unit longer than
//! `AST_CHUNK_MAX_LINES` is split into `<symbol> [part i/N]` sub-chunks
//! at blank-line paragraph boundaries (design §9.1 oversize fallback).
//!
//! tree-sitter is intentionally NOT a dependency here: AST work is
//! parser-side (`kebab-parse-code`, design §6.3). This chunker only
//! consumes the `CanonicalDocument`.
//!
//! `AST_CHUNK_MAX_LINES` is a constant matching
//! `IngestCodeCfg::default().ast_chunk_max_lines` (200). Per-medium
//! config threading needs a chunker registry (P+); same deviation
//! pattern as `pdf-page-v1`'s pinned `chunker_version`
//! (`tasks/HOTFIXES.md`).
use kebab_core::{
Block, BlockId, CanonicalDocument, Chunk, ChunkPolicy, Chunker, ChunkerVersion, DocumentId,
SourceSpan, id_for_chunk,
};
const VERSION_LABEL: &str = "code-ts-ast-v1";
const BYTES_PER_TOKEN: usize = 3;
const POLICY_HASH_HEX_LEN: usize = 16;
const AST_CHUNK_MAX_LINES: u32 = 200;
#[derive(Clone, Copy, Debug, Default)]
pub struct CodeTsAstV1Chunker;
impl Chunker for CodeTsAstV1Chunker {
fn chunker_version(&self) -> ChunkerVersion {
ChunkerVersion(VERSION_LABEL.to_string())
}
fn policy_hash(&self, policy: &ChunkPolicy) -> String {
let bytes = serde_json_canonicalizer::to_vec(policy)
.expect("canonical JSON serialization of ChunkPolicy must not fail");
let hex = blake3::hash(&bytes).to_hex().to_string();
hex[..POLICY_HASH_HEX_LEN].to_string()
}
fn chunk(
&self,
doc: &CanonicalDocument,
policy: &ChunkPolicy,
) -> anyhow::Result<Vec<Chunk>> {
for b in &doc.blocks {
let c = match b {
Block::Code(c) => c,
_ => anyhow::bail!(
"CodeTsAstV1Chunker only handles code docs (got non-Code block)"
),
};
if !matches!(c.common.source_span, SourceSpan::Code { .. }) {
anyhow::bail!(
"CodeTsAstV1Chunker only handles code docs (got non-Code source_span)"
);
}
}
let base_policy_hash = self.policy_hash(policy);
let chunker_version = self.chunker_version();
let mut out: Vec<Chunk> = Vec::new();
for b in &doc.blocks {
let cb = match b {
Block::Code(c) => c,
_ => unreachable!("validated above"),
};
let (ls, le, symbol, lang) = match &cb.common.source_span {
SourceSpan::Code { line_start, line_end, symbol, lang } => {
(*line_start, *line_end, symbol.clone(), lang.clone())
}
_ => unreachable!("validated above"),
};
let block_ids: Vec<BlockId> = vec![cb.common.block_id.clone()];
let span_lines = le.saturating_sub(ls) + 1;
if span_lines <= AST_CHUNK_MAX_LINES {
let span = SourceSpan::Code {
line_start: ls,
line_end: le,
symbol: symbol.clone(),
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
None, span, cb.code.clone(),
));
} else {
let parts = split_oversize(&cb.code);
let n = parts.len();
for (i, (off_start, off_end, text)) in parts.into_iter().enumerate() {
let part_ls = ls + off_start;
let part_le = ls + off_end;
let part_sym = symbol
.as_ref()
.map(|s| format!("{s} [part {}/{n}]", i + 1));
let span = SourceSpan::Code {
line_start: part_ls,
line_end: part_le,
symbol: part_sym,
lang: lang.clone(),
};
out.push(make_chunk(
doc, &chunker_version, &block_ids, &base_policy_hash,
Some(part_ls), span, text,
));
}
}
}
tracing::debug!(
target: "kebab-chunk",
doc_id = %doc.doc_id,
chunks = out.len(),
"code-ts-ast-v1 chunked",
);
Ok(out)
}
}
#[allow(clippy::too_many_arguments)]
fn make_chunk(
doc: &CanonicalDocument,
chunker_version: &ChunkerVersion,
block_ids: &[BlockId],
base_policy_hash: &str,
split_key: Option<u32>,
span: SourceSpan,
text: String,
) -> Chunk {
let id_hash = match split_key {
Some(k) => format!("{base_policy_hash}#L{k}"),
None => base_policy_hash.to_string(),
};
let chunk_id = id_for_chunk(&doc.doc_id, chunker_version, block_ids, &id_hash);
let token_estimate = text.len().div_ceil(BYTES_PER_TOKEN);
Chunk {
chunk_id,
doc_id: DocumentId(doc.doc_id.0.clone()),
block_ids: block_ids.to_vec(),
text,
heading_path: Vec::new(),
source_spans: vec![span],
token_estimate,
chunker_version: chunker_version.clone(),
policy_hash: base_policy_hash.to_string(),
}
}
/// Split an oversize unit at blank-line paragraph boundaries, greedily
/// gluing paragraphs until ~`AST_CHUNK_MAX_LINES` lines accumulate.
/// Returns `(line_offset_start, line_offset_end, text)` where offsets are
/// 0-based within the unit (caller adds the unit's absolute `line_start`).
fn split_oversize(code: &str) -> Vec<(u32, u32, String)> {
let lines: Vec<&str> = code.split('\n').collect();
let total = lines.len() as u32;
let mut out: Vec<(u32, u32, String)> = Vec::new();
let mut start: u32 = 0;
while start < total {
let mut end = (start + AST_CHUNK_MAX_LINES).min(total);
let floor = start + (AST_CHUNK_MAX_LINES * 4 / 5);
if end < total {
if let Some(b) = (floor.min(end)..end)
.rev()
.find(|&i| lines[i as usize].trim().is_empty())
{
end = b + 1;
}
}
let text = lines[start as usize..end as usize].join("\n");
out.push((start, end.saturating_sub(1), text));
start = end;
}
if out.is_empty() {
out.push((0, total.saturating_sub(1), code.to_string()));
}
out
}
#[cfg(test)]
mod tests {
use super::*;
use kebab_core::{
Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
SourceSpan, id_for_block, id_for_doc, AssetId, Lang, Metadata, ParserVersion, Provenance,
SourceType, TrustLevel, WorkspacePath,
};
use time::OffsetDateTime;
fn code_doc(units: &[(&str, u32, u32, &str)]) -> CanonicalDocument {
let wp = WorkspacePath("crates/x/src/a.ts".into());
let aid = AssetId("a".repeat(64));
let pv = ParserVersion("code-ts-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
let blocks = units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("typescript".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock { block_id: bid, heading_path: vec![], source_span: span },
lang: Some("typescript".into()),
code: (*code).to_string(),
})
})
.collect();
CanonicalDocument {
doc_id, source_asset_id: aid, workspace_path: wp, title: "a".into(),
lang: Lang("und".into()), blocks,
metadata: Metadata {
aliases: vec![], tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note, trust_level: TrustLevel::Primary,
user_id_alias: None, user: Default::default(),
repo: Some("kebab".into()), git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)), code_lang: Some("typescript".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv, schema_version: 1, doc_version: 1,
last_chunker_version: None, last_embedding_version: None,
}
}
fn policy() -> ChunkPolicy {
ChunkPolicy { target_tokens: 500, overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion(VERSION_LABEL.into()) }
}
#[test]
fn chunker_version_is_code_ts_ast_v1() {
assert_eq!(CodeTsAstV1Chunker.chunker_version(),
ChunkerVersion("code-ts-ast-v1".into()));
}
#[test]
fn one_chunk_per_unit_preserves_code_span() {
let doc = code_doc(&[
("parse", 1, 3, "function parse(): void {\n // x\n}"),
("Foo.double", 5, 7, "function double(): number {\n //\n return 0;\n}"),
]);
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert_eq!(chunks.len(), 2);
for c in &chunks {
assert_eq!(c.source_spans.len(), 1);
assert!(matches!(c.source_spans[0], SourceSpan::Code { .. }));
assert_eq!(c.heading_path, Vec::<String>::new());
assert_eq!(c.chunker_version.0, "code-ts-ast-v1");
}
match &chunks[0].source_spans[0] {
SourceSpan::Code { symbol, line_start, line_end, .. } => {
assert_eq!(symbol.as_deref(), Some("parse"));
assert_eq!((*line_start, *line_end), (1, 3));
}
_ => unreachable!(),
}
}
#[test]
fn oversize_unit_splits_into_parts_with_unique_ids() {
let body = (0..500).map(|i| format!(" const x{i} = {i};")).collect::<Vec<_>>().join("\n");
let code = format!("function big(): void {{\n{body}\n}}");
let doc = code_doc(&[("big", 1, 502, &code)]);
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap();
assert!(chunks.len() >= 2, "oversize unit must split, got {}", chunks.len());
for c in &chunks {
match &c.source_spans[0] {
SourceSpan::Code { symbol, .. } => {
assert!(symbol.as_deref().unwrap().starts_with("big [part "),
"part-numbered symbol, got {symbol:?}");
}
_ => unreachable!(),
}
}
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
let n = ids.len(); ids.sort(); ids.dedup();
assert_eq!(ids.len(), n, "chunk_ids unique across split parts");
}
#[test]
fn non_code_doc_errors() {
use kebab_core::TextBlock;
let mut doc = code_doc(&[("parse", 1, 1, "function parse(): void {}")]);
doc.blocks = vec![Block::Paragraph(TextBlock {
common: CommonBlock {
block_id: kebab_core::BlockId("b".into()),
heading_path: vec![],
source_span: SourceSpan::Line { start: 1, end: 1 },
},
text: "x".into(), inlines: vec![],
})];
let err = CodeTsAstV1Chunker.chunk(&doc, &policy()).unwrap_err();
assert!(err.to_string().contains("CodeTsAstV1Chunker"));
}
#[test]
fn deterministic_chunk_ids_1000() {
let doc = code_doc(&[("parse", 1, 2, "function parse(): void {}\n")]);
let base: Vec<String> = CodeTsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
for _ in 0..1000 {
let again: Vec<String> = CodeTsAstV1Chunker.chunk(&doc, &policy())
.unwrap().into_iter().map(|c| c.chunk_id.0).collect();
assert_eq!(again, base);
}
}
#[test]
fn policy_hash_matches_md_heading_v1() {
let p = policy();
assert_eq!(CodeTsAstV1Chunker.policy_hash(&p),
crate::MdHeadingV1Chunker.policy_hash(&p));
}
}

View File

@@ -15,8 +15,16 @@
//! embedder, the retriever, the LLM, the RAG layer, or the UI layers.
//! It consumes `CanonicalDocument` purely through `kb-core` types.
mod code_js_ast_v1;
mod code_python_ast_v1;
mod code_rust_ast_v1;
mod code_ts_ast_v1;
mod md_heading_v1;
mod pdf_page_v1;
pub use code_js_ast_v1::CodeJsAstV1Chunker;
pub use code_python_ast_v1::CodePythonAstV1Chunker;
pub use code_rust_ast_v1::CodeRustAstV1Chunker;
pub use code_ts_ast_v1::CodeTsAstV1Chunker;
pub use md_heading_v1::MdHeadingV1Chunker;
pub use pdf_page_v1::PdfPageV1Chunker;

View File

@@ -472,6 +472,10 @@ mod tests {
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: None,
git_branch: None,
git_commit: None,
code_lang: None,
},
provenance: Provenance { events: vec![] },
parser_version: kebab_core::ParserVersion("test-parser-0".into()),

View File

@@ -347,6 +347,10 @@ mod tests {
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: None,
git_branch: None,
git_commit: None,
code_lang: None,
},
provenance: Provenance { events: vec![] },
parser_version,
@@ -512,6 +516,10 @@ mod tests {
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: None,
git_branch: None,
git_commit: None,
code_lang: None,
},
provenance: Provenance { events: vec![] },
parser_version,

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative JavaScript code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeJsAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("src/bar.js".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-js-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "function bigTransform(items) {\n";
let body: String = (0..210u32)
.map(|i| format!(" const v{i} = items[{i}] !== undefined ? items[{i}] : null;\n"))
.collect();
let footer = " return items;\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. require/import block (lines 15, ≤200)
// 1. free fn `add` (lines 712, ≤200)
// 2. class `EventBus` (lines 1420, ≤200)
// 3. class `BaseHandler` (lines 2230, ≤200)
// 4. method `EventBus.emit` (lines 3238, ≤200)
// 5. method `EventBus.on` (lines 4046, ≤200)
// 6. bigTransform (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"requires",
1,
5,
"const fs = require('fs');\nconst path = require('path');\nconst { EventEmitter } = require('events');\nconst assert = require('assert');\nconst crypto = require('crypto');".to_string(),
),
(
"add",
7,
12,
"export function add(a, b) {\n if (typeof a !== 'number') throw new TypeError('a');\n if (typeof b !== 'number') throw new TypeError('b');\n const result = a + b;\n assert(isFinite(result));\n return result;\n}".to_string(),
),
(
"EventBus",
14,
20,
"class EventBus {\n constructor() {\n this._handlers = new Map();\n this._history = [];\n this._maxHistory = 100;\n this._seq = 0;\n }\n}".to_string(),
),
(
"BaseHandler",
22,
30,
"class BaseHandler {\n handle(event) {\n throw new Error('not implemented');\n }\n batchHandle(events) {\n const results = [];\n for (const ev of events) {\n results.push(this.handle(ev));\n }\n return results;\n }\n}".to_string(),
),
(
"EventBus.emit",
32,
38,
"class EventBus {\n emit(name, payload) {\n const handlers = this._handlers.get(name) ?? [];\n for (const h of handlers) {\n h(payload);\n }\n return this;\n }\n}".to_string(),
),
(
"EventBus.on",
40,
46,
"class EventBus {\n on(name, handler) {\n if (!this._handlers.has(name)) {\n this._handlers.set(name, []);\n }\n this._handlers.get(name).push(handler);\n return this;\n }\n}".to_string(),
),
("bigTransform", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("javascript".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("javascript".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "bar.js".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("javascript".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-js-ast-v1".into()),
}
}
#[test]
fn code_js_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeJsAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.js.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-js-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_js_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeJsAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeJsAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Python code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodePythonAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("kebab_eval/metrics.py".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-python-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "def big_compute(data):\n";
let body: String = (0..210u32)
.map(|i| format!(" v{i} = data[{i}] if {i} < len(data) else 0\n"))
.collect();
let footer = " return sum(data)";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. free fn `compute_mrr` (lines 712, ≤200)
// 2. class `MetricsCollector` (lines 1420, ≤200)
// 3. class `BaseEvaluator` (lines 2230, ≤200)
// 4. method `run` (lines 3238, ≤200)
// 5. method `report` (lines 4046, ≤200)
// 6. big_compute (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import os\nimport sys\nfrom typing import List\nfrom pathlib import Path\nfrom collections import defaultdict".to_string(),
),
(
"compute_mrr",
7,
12,
"def compute_mrr(scores):\n if not scores:\n return 0.0\n return sum(\n 1.0 / r for r in scores\n ) / len(scores)".to_string(),
),
(
"MetricsCollector",
14,
20,
"class MetricsCollector:\n def __init__(self):\n self.scores = []\n self.labels = []\n self.counts = defaultdict(int)\n self.totals = defaultdict(float)\n self.tags = []".to_string(),
),
(
"BaseEvaluator",
22,
30,
"class BaseEvaluator:\n def evaluate(self, data):\n raise NotImplementedError\n def batch_evaluate(self, items):\n results = []\n for item in items:\n results.append(self.evaluate(item))\n return results\n def name(self):\n return type(self).__name__".to_string(),
),
(
"MetricsCollector.run",
32,
38,
"class MetricsCollector:\n def run(self, inputs):\n for inp in inputs:\n score = self._score(inp)\n self.scores.append(\n score\n )".to_string(),
),
(
"MetricsCollector.report",
40,
46,
"class MetricsCollector:\n def report(self):\n return {\n 'mean': sum(self.scores) / max(len(self.scores), 1),\n 'count': len(self.scores),\n 'tags': self.tags,\n }".to_string(),
),
("big_compute", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("python".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("python".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "metrics.py".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("python".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-python-ast-v1".into()),
}
}
#[test]
fn code_python_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodePythonAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.py.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-python-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_python_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodePythonAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodePythonAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative Rust code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeRustAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("crates/kebab-chunk/src/code_rust_ast_v1.rs".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-rust-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line function body to force split_oversize.
let big_body: String = {
let header = "pub fn big_fn(input: &[u8]) -> Vec<u8> {\n";
let body: String = (0..210u32)
.map(|i| format!(" let v{i} = input.get({i} as usize).copied().unwrap_or(0);\n"))
.collect();
let footer = " vec![0u8]\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. top-level use+const block (lines 15, ≤200)
// 1. free fn `parse` (lines 712, ≤200)
// 2. struct `Foo` (lines 1420, ≤200)
// 3. trait `Frobable` (lines 2230, ≤200)
// 4. impl Foo::double (lines 3238, ≤200)
// 5. impl Foo::triple (lines 4046, ≤200)
// 6. big_fn (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"use+const",
1,
5,
"use std::collections::HashMap;\nuse std::fmt;\n\nconst MAX: usize = 1024;\nconst MIN: usize = 0;".to_string(),
),
(
"parse",
7,
12,
"pub fn parse(input: &str) -> Option<u32> {\n input\n .trim()\n .parse()\n .ok()\n}".to_string(),
),
(
"Foo",
14,
20,
"pub struct Foo {\n pub name: String,\n pub value: u32,\n pub tags: Vec<String>,\n pub meta: Option<String>,\n pub count: usize,\n}".to_string(),
),
(
"Frobable",
22,
30,
"pub trait Frobable {\n fn frob(&self) -> String;\n fn frob_twice(&self) -> String {\n let a = self.frob();\n let b = self.frob();\n format!(\"{a}{b}\")\n }\n fn name(&self) -> &str;\n}".to_string(),
),
(
"Foo::double",
32,
38,
"impl Foo {\n pub fn double(&self) -> u32 {\n self.value\n .checked_mul(2)\n .unwrap_or(u32::MAX)\n }\n}".to_string(),
),
(
"Foo::triple",
40,
46,
"impl Foo {\n pub fn triple(&self) -> u32 {\n self.value\n .checked_mul(3)\n .unwrap_or(u32::MAX)\n }\n}".to_string(),
),
("big_fn", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("rust".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("rust".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "code_rust_ast_v1.rs".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("rust".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-rust-ast-v1".into()),
}
}
#[test]
fn code_rust_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeRustAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-rust-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_rust_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeRustAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeRustAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

View File

@@ -0,0 +1,221 @@
//! Snapshot test pinning the `Vec<Chunk>` JSON for a
//! representative TypeScript code `CanonicalDocument`.
//!
//! This is an integration test. `kebab-parse-code` is intentionally NOT
//! a dev-dep (design §6.3 / §8 boundary: AST extraction is parser-side).
//! The `CanonicalDocument` is built inline from hand-crafted `Block::Code`
//! units, which is the same pattern used in `code_rust_ast_v1.rs`'s
//! internal `code_doc` test helper.
//!
//! Set `UPDATE_SNAPSHOTS=1` to re-bake the baseline.
use std::path::PathBuf;
use kebab_chunk::CodeTsAstV1Chunker;
use kebab_core::{
AssetId, Block, CanonicalDocument, ChunkPolicy, Chunker, ChunkerVersion, CodeBlock, CommonBlock,
Lang, Metadata, ParserVersion, Provenance, SourceSpan, SourceType, TrustLevel, WorkspacePath,
id_for_block, id_for_doc,
};
use serde_json::Value;
use time::OffsetDateTime;
fn fixtures_dir() -> PathBuf {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("tests")
.join("fixtures")
}
fn fixed_doc() -> CanonicalDocument {
let wp = WorkspacePath("src/Foo.ts".into());
let aid = AssetId("b".repeat(64));
// Pin parser_version so doc_id / block_ids are reproducible.
let pv = ParserVersion("code-ts-v1".into());
let doc_id = id_for_doc(&wp, &aid, &pv);
// Build a >200-line method body to force split_oversize.
let big_body: String = {
let header = "export class BigProcessor {\n process(items: string[]): string[] {\n";
let body: String = (0..210u32)
.map(|i| format!(" const v{i} = items[{i}] ?? '';\n"))
.collect();
let footer = " return items;\n }\n}";
format!("{header}{body}{footer}")
};
let big_line_count = big_body.lines().count() as u32;
let big_line_end = 48 + big_line_count - 1;
// Representative units:
// 0. import block (lines 15, ≤200)
// 1. free fn `parseInput` (lines 712, ≤200)
// 2. interface `Frobable` (lines 1420, ≤200)
// 3. class `Foo` (lines 2230, ≤200)
// 4. method `Foo.double` (lines 3238, ≤200)
// 5. method `Foo.triple` (lines 4046, ≤200)
// 6. BigProcessor (>200 lines) to force split_oversize
let raw_units: Vec<(&str, u32, u32, String)> = vec![
(
"imports",
1,
5,
"import { readFileSync } from 'fs';\nimport { join } from 'path';\nimport type { Config } from './config';\nimport { Logger } from './logger';\nimport { EventEmitter } from 'events';".to_string(),
),
(
"parseInput",
7,
12,
"export function parseInput(raw: string): number | null {\n const trimmed = raw.trim();\n const n = Number(trimmed);\n if (isNaN(n)) return null;\n return n;\n}".to_string(),
),
(
"Frobable",
14,
20,
"export interface Frobable {\n frob(): string;\n frobTwice(): string;\n readonly name: string;\n readonly tags: string[];\n count: number;\n reset(): void;\n}".to_string(),
),
(
"Foo",
22,
30,
"export class Foo implements Frobable {\n constructor(\n public readonly name: string,\n public value: number,\n public tags: string[] = [],\n ) {}\n frob(): string { return this.name; }\n frobTwice(): string { return this.name.repeat(2); }\n reset(): void { this.value = 0; }\n}".to_string(),
),
(
"Foo.double",
32,
38,
"export class Foo {\n double(): number {\n const result = this.value * 2;\n if (result > Number.MAX_SAFE_INTEGER) {\n return Number.MAX_SAFE_INTEGER;\n }\n return result;\n }\n}".to_string(),
),
(
"Foo.triple",
40,
46,
"export class Foo {\n triple(): number {\n const result = this.value * 3;\n if (result > Number.MAX_SAFE_INTEGER) {\n return Number.MAX_SAFE_INTEGER;\n }\n return result;\n }\n}".to_string(),
),
("BigProcessor", 48, big_line_end, big_body),
];
let blocks: Vec<Block> = raw_units
.iter()
.enumerate()
.map(|(i, (sym, ls, le, code))| {
let span = SourceSpan::Code {
line_start: *ls,
line_end: *le,
symbol: Some((*sym).to_string()),
lang: Some("typescript".into()),
};
let bid = id_for_block(&doc_id, "code", &[], i as u32, &span);
Block::Code(CodeBlock {
common: CommonBlock {
block_id: bid,
heading_path: vec![],
source_span: span,
},
lang: Some("typescript".into()),
code: code.clone(),
})
})
.collect();
CanonicalDocument {
doc_id,
source_asset_id: aid,
workspace_path: wp,
title: "Foo.ts".into(),
lang: Lang("und".into()),
blocks,
metadata: Metadata {
aliases: vec![],
tags: vec![],
created_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
updated_at: OffsetDateTime::from_unix_timestamp(1_700_000_000).unwrap(),
source_type: SourceType::Note,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("0".repeat(40)),
code_lang: Some("typescript".into()),
},
provenance: Provenance { events: vec![] },
parser_version: pv,
schema_version: 1,
doc_version: 1,
last_chunker_version: None,
last_embedding_version: None,
}
}
fn fixed_policy() -> ChunkPolicy {
ChunkPolicy {
target_tokens: 500,
overlap_tokens: 80,
respect_markdown_headings: false,
chunker_version: ChunkerVersion("code-ts-ast-v1".into()),
}
}
#[test]
fn code_ts_ast_chunks_snapshot() {
let doc = fixed_doc();
let policy = fixed_policy();
let chunks = CodeTsAstV1Chunker.chunk(&doc, &policy).expect("chunk");
let actual = serde_json::to_value(&chunks).unwrap();
let dir = fixtures_dir();
let baseline_path = dir.join("code-sample.ts.chunks.snapshot.json");
let baseline_text = match std::fs::read_to_string(&baseline_path) {
Ok(s) => s,
Err(_) if std::env::var("UPDATE_SNAPSHOTS").is_ok() => {
std::fs::create_dir_all(&dir).unwrap();
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
return;
}
Err(e) => panic!(
"missing baseline {}; run with UPDATE_SNAPSHOTS=1 to create: {e}",
baseline_path.display()
),
};
let expected: Value = serde_json::from_str(&baseline_text).expect("baseline parses as json");
if actual != expected {
if std::env::var("UPDATE_SNAPSHOTS").is_ok() {
let pretty = serde_json::to_string_pretty(&actual).unwrap();
std::fs::write(&baseline_path, format!("{pretty}\n")).unwrap();
eprintln!("updated baseline {}", baseline_path.display());
return;
}
let pretty = serde_json::to_string_pretty(&actual).unwrap();
panic!(
"code-ts-ast-v1 chunks snapshot drift\n\
--- expected ({}) ---\n{baseline_text}\n\
--- actual ---\n{pretty}\n\
If intentional, re-run with UPDATE_SNAPSHOTS=1.",
baseline_path.display()
);
}
}
/// Determinism cross-check: re-running the same pipeline yields the same
/// chunk_ids byte-for-byte.
#[test]
fn code_ts_ast_chunks_are_deterministic() {
let policy = fixed_policy();
let baseline: Vec<String> = CodeTsAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
for _ in 0..5 {
let again: Vec<String> = CodeTsAstV1Chunker
.chunk(&fixed_doc(), &policy)
.unwrap()
.into_iter()
.map(|c| c.chunk_id.0)
.collect();
assert_eq!(again, baseline);
}
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -27,9 +27,12 @@ kebab-eval = { path = "../kebab-eval" }
# enforces the §8 boundary in its own Cargo.toml; kb-cli just
# launches it.
kebab-tui = { path = "../kebab-tui" }
# p9-fb-30: MCP stdio server. `Cmd::Mcp` delegates entirely to this crate.
kebab-mcp = { path = "../kebab-mcp" }
anyhow = { workspace = true }
serde = { workspace = true }
serde_json = { workspace = true }
clap = { version = "4", features = ["derive"] }
clap = { version = "4", features = ["derive", "env"] }
# p9-fb-02: ingest progress UI.
# - TTY 사람 모드: indicatif spinner + bar (stderr).
# - --json 모드 / non-TTY: indicatif 끄고 raw line emit.
@@ -43,3 +46,7 @@ ctrlc = "3"
[dev-dependencies]
tempfile = { workspace = true }
# p9-fb-32: backdate `documents.updated_at` in CLI integration tests
# to simulate stale docs. `time` is the formatter used by the helper.
rusqlite = { workspace = true }
time = { workspace = true }

File diff suppressed because it is too large Load Diff

View File

@@ -39,18 +39,22 @@ pub enum ProgressMode {
Json,
/// stdout reserved for the final report; stderr gets an indicatif
/// `ProgressBar` (TTY) or one short line per event (non-TTY).
Human { tty: bool },
Human { tty: bool, quiet: bool },
}
impl ProgressMode {
/// Pick the right mode from caller flags.
pub fn from_flags(json: bool) -> Self {
///
/// - `json`: `--json` flag — takes priority, returns `Json`.
/// - `quiet`: `--quiet` flag — suppresses human-readable stderr when `Human`.
/// - `plain_env`: `KEBAB_PROGRESS=plain` — forces `tty=false` even in a TTY,
/// for CI environments that emulate a TTY with a pty wrapper.
pub fn from_flags(json: bool, quiet: bool, plain_env: bool) -> Self {
if json {
Self::Json
} else {
Self::Human {
tty: std::io::stderr().is_terminal(),
}
let tty = !plain_env && std::io::stderr().is_terminal();
Self::Human { tty, quiet }
}
}
}
@@ -83,7 +87,7 @@ impl ProgressDisplay {
fn handle(&mut self, event: &IngestEvent) -> anyhow::Result<()> {
match self.mode {
ProgressMode::Json => emit_json(event),
ProgressMode::Human { tty } => self.handle_human(event, tty),
ProgressMode::Human { tty, quiet } => self.handle_human(event, tty, quiet),
}
}
@@ -96,18 +100,20 @@ impl ProgressDisplay {
/// `ScanStarted` arm and §2.4a's ordering invariant
/// (`ScanStarted` < everything else) guarantees it is `Some` by
/// the time later events arrive.
fn handle_human(&mut self, event: &IngestEvent, tty: bool) -> anyhow::Result<()> {
fn handle_human(&mut self, event: &IngestEvent, tty: bool, quiet: bool) -> anyhow::Result<()> {
match event {
IngestEvent::ScanStarted { root } => {
let bar = ProgressBar::new_spinner().with_message(format!("scanning {root}"));
bar.set_draw_target(if tty {
bar.set_draw_target(if tty && !quiet {
ProgressDrawTarget::stderr()
} else {
ProgressDrawTarget::hidden()
});
bar.enable_steady_tick(std::time::Duration::from_millis(100));
if tty && !quiet {
bar.enable_steady_tick(std::time::Duration::from_millis(100));
}
self.bar = Some(bar);
if !tty {
if !tty && !quiet {
let mut err = std::io::stderr().lock();
let _ = writeln!(err, "ingest: scanning {root}…");
}
@@ -126,7 +132,7 @@ impl ProgressDisplay {
);
bar.set_message("");
}
if !tty {
if !tty && !quiet {
let mut err = std::io::stderr().lock();
let _ = writeln!(err, "ingest: scan complete ({total} assets)");
}
@@ -138,23 +144,28 @@ impl ProgressDisplay {
media,
} => {
if let Some(bar) = self.bar.as_ref() {
bar.set_message(format!("{media} {path}"));
// One draw per file: position only. set_message() would
// trigger a second independent draw and pollute TTY scrollback.
// Filename is visible in the non-TTY plain-line path below.
bar.set_position(u64::from(idx.saturating_sub(1)));
}
if !tty {
if !tty && !quiet {
let mut err = std::io::stderr().lock();
let _ = writeln!(err, "ingest: {idx}/{total} {media} {path}");
}
}
IngestEvent::AssetFinished { idx, .. } => {
if let Some(bar) = self.bar.as_ref() {
bar.set_position(u64::from(*idx));
}
IngestEvent::AssetFinished { .. } => {
// Position is advanced in AssetStarted; bar.finish_and_clear()
// in Completed handles the final state. No per-asset bar update
// here avoids the duplicate-frame artifact in TTY scrollback.
}
IngestEvent::Completed { counts } => {
if let Some(bar) = self.bar.take() {
bar.finish_and_clear();
}
if !tty {
// Always emit summary in both TTY and non-TTY (unless quiet).
// Bug fix: previously TTY had no summary line after bar.finish_and_clear().
if !quiet {
let mut err = std::io::stderr().lock();
let _ = writeln!(
err,
@@ -175,16 +186,20 @@ impl ProgressDisplay {
counts.scanned
));
}
let mut err = std::io::stderr().lock();
let _ = writeln!(
err,
"ingest: aborted (scanned={} new={} updated={} skipped={} errors={})",
counts.scanned,
counts.new,
counts.updated,
counts.skipped,
counts.errors,
);
// Bug fix: was unconditional (fired in TTY too).
// In TTY, bar.abandon_with_message already prints the final state.
if !tty && !quiet {
let mut err = std::io::stderr().lock();
let _ = writeln!(
err,
"ingest: aborted (scanned={} new={} updated={} skipped={} errors={})",
counts.scanned,
counts.new,
counts.updated,
counts.skipped,
counts.errors,
);
}
}
}
Ok(())
@@ -216,20 +231,35 @@ mod tests {
#[test]
fn from_flags_json_takes_priority_over_tty() {
// --json forces Json regardless of TTY state.
assert_eq!(ProgressMode::from_flags(true), ProgressMode::Json);
assert_eq!(ProgressMode::from_flags(true, false, false), ProgressMode::Json);
}
#[test]
fn from_flags_human_reflects_stderr_tty() {
// We can't synthesize a TTY in tests, but we can assert the
// shape — mode is Human { tty: <something> } when --json=false.
match ProgressMode::from_flags(false) {
match ProgressMode::from_flags(false, false, false) {
ProgressMode::Human { .. } => {}
other => panic!("expected Human mode, got {other:?}"),
}
}
#[test]
fn from_flags_quiet_sets_quiet_field() {
match ProgressMode::from_flags(false, true, false) {
ProgressMode::Human { quiet: true, .. } => {}
other => panic!("expected Human{{quiet:true}}, got {other:?}"),
}
}
#[test]
fn from_flags_plain_env_forces_tty_false() {
match ProgressMode::from_flags(false, false, true) {
ProgressMode::Human { tty: false, .. } => {}
other => panic!("expected Human{{tty:false}}, got {other:?}"),
}
}
#[test]
fn now_rfc3339_parses_back() {
let s = now_rfc3339().unwrap();

View File

@@ -75,10 +75,24 @@ pub fn wire_search_hit(h: &SearchHit) -> Value {
tag_object(v, "search_hit.v1")
}
/// Wrap a list of [`SearchHit`] values as a JSON array of `search_hit.v1`
/// objects (one tag per element, per design §2.2).
pub fn wire_search_hits(hits: &[SearchHit]) -> Value {
Value::Array(hits.iter().map(wire_search_hit).collect())
/// p9-fb-34: tag a `SearchResponse` as `search_response.v1`. Wraps
/// the existing `search_hit.v1[]` array with pagination + truncation
/// metadata. Replaces the previous bare `search_hit.v1[]` top-level
/// array (`wire_search_hits`) — see HOTFIXES / fb-34 for the
/// breaking shape change.
pub fn wire_search_response(r: &kebab_app::SearchResponse) -> Value {
let mut v = serde_json::json!({
"hits": r.hits.iter().map(wire_search_hit).collect::<Vec<_>>(),
"next_cursor": r.next_cursor,
"truncated": r.truncated,
});
if let Some(trace) = &r.trace {
let trace_v = serde_json::to_value(trace).expect("SearchTrace serializes");
if let Value::Object(ref mut map) = v {
map.insert("trace".to_string(), trace_v);
}
}
tag_object(v, "search_response.v1")
}
/// Wrap an [`Answer`] as `answer.v1`.
@@ -87,6 +101,25 @@ pub fn wire_answer(a: &Answer) -> Value {
tag_object(v, "answer.v1")
}
/// p9-fb-33: tag a [`StreamEvent`] as `answer_event.v1` ndjson.
///
/// The timestamp is added at emit time (caller fills `ts`), since the
/// pipeline doesn't carry one in the in-process enum — mirrors the
/// `wire_ingest_progress` pattern (§2 ingest_progress.v1).
pub fn wire_answer_event(
ev: &kebab_app::StreamEvent,
ts: time::OffsetDateTime,
) -> Value {
let mut v = serde_json::to_value(ev).expect("StreamEvent serializes");
let ts_str = ts
.format(&time::format_description::well_known::Rfc3339)
.expect("OffsetDateTime formats as RFC3339");
if let Value::Object(ref mut map) = v {
map.insert("ts".to_string(), Value::String(ts_str));
}
tag_object(v, "answer_event.v1")
}
/// Idempotent pass-through for [`DoctorReport`] — the type already carries
/// `schema_version: "doctor.v1"` (struct-field convention, the one
/// exception called out in the module doc above). This helper exists so
@@ -133,6 +166,55 @@ pub fn wire_ingest_progress(
Ok(tag_object(v, "ingest_progress.v1"))
}
/// Wrap a [`kebab_app::SchemaV1`] as `schema.v1`.
///
/// Uses the idempotent re-tag pattern (mirrors `wire_doctor`) because
/// `SchemaV1` already carries `schema_version: "schema.v1"` as a struct
/// field. The re-tag is a defensive no-op when the field is present; it
/// stamps the correct version if a future refactor ever drops the field.
pub fn wire_schema(s: &kebab_app::SchemaV1) -> Value {
let v = serde_json::to_value(s).expect("SchemaV1 serializes");
if let Value::Object(ref map) = v {
if matches!(
map.get("schema_version"),
Some(Value::String(s)) if s == kebab_app::SCHEMA_V1_ID
) {
return v;
}
}
tag_object(v, kebab_app::SCHEMA_V1_ID)
}
/// Wrap an [`kebab_app::ErrorV1`] as `error.v1`.
///
/// Uses the simple `tag_object` pattern because `ErrorV1` is a
/// type that does NOT carry `schema_version` itself
/// (kebab-core convention).
pub fn wire_error_v1(e: &kebab_app::ErrorV1) -> Value {
let v = serde_json::to_value(e).expect("ErrorV1 serializes");
tag_object(v, "error.v1")
}
/// p9-fb-35: tag a [`kebab_core::FetchResult`] as `fetch_result.v1`.
pub fn wire_fetch_result(r: &kebab_core::FetchResult) -> Value {
let v = serde_json::to_value(r).expect("FetchResult serializes");
tag_object(v, "fetch_result.v1")
}
/// p9-fb-42: tag a `BulkSearchItem` (already serialized as a Value)
/// as `bulk_search_item.v1`. The inner `query` / `response` / `error`
/// fields stay verbatim — only the envelope gets the schema_version stamp.
pub fn wire_bulk_search_item(item: &kebab_core::BulkSearchItem) -> Value {
let mut v = serde_json::to_value(item).expect("BulkSearchItem serializes");
if let Value::Object(ref mut map) = v {
map.insert(
"schema_version".to_string(),
Value::String("bulk_search_item.v1".to_string()),
);
}
v
}
#[cfg(test)]
mod tests {
use super::*;
@@ -157,7 +239,7 @@ mod tests {
#[test]
fn ingest_wrapper_tags_schema_version() {
use kebab_core::SourceScope;
use kebab_core::{SkipExamples, SourceScope};
let r = IngestReport {
scope: SourceScope {
root: std::path::PathBuf::from("/tmp"),
@@ -171,6 +253,13 @@ mod tests {
unchanged: 0,
errors: 0,
duration_ms: 0,
skipped_by_extension: std::collections::BTreeMap::new(),
skipped_gitignore: 0,
skipped_kebabignore: 0,
skipped_builtin_blacklist: 0,
skipped_generated: 0,
skipped_size_exceeded: 0,
skip_examples: SkipExamples::default(),
items: None,
};
let v = wire_ingest(&r);
@@ -185,13 +274,6 @@ mod tests {
assert_eq!(v.as_array().unwrap().len(), 0);
}
#[test]
fn search_hits_wraps_each_element() {
let v = wire_search_hits(&[]);
assert!(v.is_array());
assert_eq!(v.as_array().unwrap().len(), 0);
}
#[test]
fn tag_object_inserts_into_object() {
let v = Value::Object(serde_json::Map::new());
@@ -199,6 +281,83 @@ mod tests {
assert_eq!(schema_of(&tagged), Some("x.v1"));
}
#[test]
fn search_response_carries_pagination_metadata() {
// p9-fb-34: empty-hits SearchResponse round-trips through the
// wrapper with its `next_cursor` + `truncated` fields preserved
// and the top-level `schema_version` set to `search_response.v1`.
let r = kebab_app::SearchResponse {
hits: vec![],
next_cursor: Some("opaque-cursor-abc".to_string()),
truncated: true,
trace: None,
};
let v = wire_search_response(&r);
assert_eq!(schema_of(&v), Some("search_response.v1"));
assert!(v.get("hits").and_then(|h| h.as_array()).is_some());
assert_eq!(
v.get("hits").and_then(|h| h.as_array()).unwrap().len(),
0
);
assert_eq!(
v.get("next_cursor").and_then(|c| c.as_str()),
Some("opaque-cursor-abc")
);
assert_eq!(v.get("truncated").and_then(|t| t.as_bool()), Some(true));
}
#[test]
fn schema_wrapper_tags_schema_version() {
use kebab_app::{Capabilities, Models, SchemaV1, Stats, WireBlock};
let schema = SchemaV1 {
schema_version: "schema.v1".to_string(),
kebab_version: "0.2.1".to_string(),
wire: WireBlock { schemas: vec!["answer.v1".to_string()] },
capabilities: Capabilities {
json_mode: true, ingest_progress: true, ingest_cancellation: true,
rag_multi_turn: true, search_cache: true, incremental_ingest: true,
streaming_ask: false, http_daemon: false, mcp_server: false,
single_file_ingest: false, bulk_search: true,
},
models: Models {
parser_version: "x".to_string(),
chunker_version: "y".to_string(),
embedding_version: "z".to_string(),
prompt_template_version: "w".to_string(),
index_version: "v".to_string(),
corpus_revision: 7,
},
stats: Stats {
doc_count: 1, chunk_count: 2, asset_count: 1,
last_ingest_at: None,
media_breakdown: Default::default(),
lang_breakdown: Default::default(),
index_bytes: Default::default(),
stale_doc_count: 0,
// p10-1A-1: new fields added to Stats; use Default for the test fixture.
..Default::default()
},
};
let v = wire_schema(&schema);
assert_eq!(schema_of(&v), Some("schema.v1"));
assert_eq!(v.get("kebab_version").and_then(Value::as_str), Some("0.2.1"));
}
#[test]
fn error_wrapper_tags_schema_version_and_emits_code() {
use kebab_app::ErrorV1;
let err = ErrorV1 {
schema_version: "error.v1".to_string(),
code: "config_invalid".to_string(),
message: "bad config".to_string(),
details: serde_json::json!({"path": "/tmp/x"}),
hint: Some("check the path".to_string()),
};
let v = wire_error_v1(&err);
assert_eq!(schema_of(&v), Some("error.v1"));
assert_eq!(v.get("code").and_then(Value::as_str), Some("config_invalid"));
}
#[test]
fn reset_wrapper_tags_schema_version_and_serializes_scope() {
let r = kebab_app::ResetReport {
@@ -217,4 +376,49 @@ mod tests {
assert_eq!(paths.len(), 1);
assert_eq!(paths[0].as_str(), Some("/tmp/x"));
}
#[test]
fn search_response_with_trace_serializes_trace_field() {
use kebab_core::{SearchTrace, TraceCandidate, TraceFusionInput,
TraceTiming, ChunkId, DocumentId, WorkspacePath};
let r = kebab_app::SearchResponse {
hits: vec![],
next_cursor: None,
truncated: false,
trace: Some(SearchTrace {
lexical: vec![TraceCandidate {
chunk_id: ChunkId("c1".into()),
doc_id: DocumentId("d1".into()),
doc_path: WorkspacePath::new("a.md".into()).unwrap(),
rank: 1,
score: 0.42,
}],
vector: vec![],
rrf_inputs: vec![TraceFusionInput {
chunk_id: ChunkId("c1".into()),
lexical_rank: Some(1),
vector_rank: None,
fusion_score: 0.0,
}],
timing: TraceTiming { lexical_ms: 5, vector_ms: 0, fusion_ms: 1, total_ms: 7 },
}),
};
let v = wire_search_response(&r);
assert_eq!(schema_of(&v), Some("search_response.v1"));
assert!(v["trace"].is_object());
assert_eq!(v["trace"]["timing"]["lexical_ms"], 5);
assert_eq!(v["trace"]["lexical"][0]["chunk_id"], "c1");
}
#[test]
fn search_response_without_trace_omits_field() {
let r = kebab_app::SearchResponse {
hits: vec![],
next_cursor: None,
truncated: false,
trace: None,
};
let v = wire_search_response(&r);
assert!(v.get("trace").is_none(), "trace field absent when None");
}
}

View File

@@ -0,0 +1,105 @@
//! Integration: spawn kebab and verify --json mode emits error.v1 ndjson
//! on stderr while non-json mode emits the legacy `error:` text prefix.
//!
//! The `config_invalid` code is triggered by supplying an *existing* but
//! malformed TOML file via `--config`. Note: supplying a *non-existent*
//! path does NOT trigger this error — Config::load silently falls back to
//! defaults when the specified config file is absent (by design, so that
//! `kebab doctor` runs before `kebab init` is ever called). A file that
//! exists but fails TOML parsing is the reliable path to `config_invalid`.
use std::process::Command;
fn kebab_bin() -> std::path::PathBuf {
let manifest = env!("CARGO_MANIFEST_DIR");
std::path::PathBuf::from(manifest)
.parent()
.unwrap()
.parent()
.unwrap()
.join("target/debug/kebab")
}
fn xdg_envs(tmp: &std::path::Path) -> [(&'static str, std::path::PathBuf); 4] {
[
("XDG_CONFIG_HOME", tmp.join("cfg")),
("XDG_DATA_HOME", tmp.join("data")),
("XDG_CACHE_HOME", tmp.join("cache")),
("XDG_STATE_HOME", tmp.join("state")),
]
}
#[test]
fn json_mode_emits_error_v1_on_config_invalid() {
let tmp = tempfile::tempdir().unwrap();
// Write a file that exists but is not valid TOML.
let bad_config = tmp.path().join("bad.toml");
std::fs::write(&bad_config, b"this is not { valid toml !!!").unwrap();
let mut cmd = Command::new(kebab_bin());
cmd.args([
"--json",
"--config",
bad_config.to_str().unwrap(),
"ingest",
]);
for (k, v) in xdg_envs(tmp.path()) {
cmd.env(k, v);
}
let out = cmd.output().unwrap();
assert!(
!out.status.success(),
"expected non-zero exit for config_invalid"
);
let exit_code = out.status.code().unwrap_or(-1);
assert_eq!(exit_code, 2, "expected exit code 2, got {exit_code}");
let stderr = String::from_utf8(out.stderr).unwrap();
let first_line = stderr.lines().next().expect("stderr must have at least one line");
let v: serde_json::Value =
serde_json::from_str(first_line).expect("stderr first line must be valid JSON");
assert_eq!(
v.get("schema_version").and_then(|s| s.as_str()),
Some("error.v1"),
"schema_version must be error.v1"
);
assert_eq!(
v.get("code").and_then(|s| s.as_str()),
Some("config_invalid"),
"code must be config_invalid"
);
}
#[test]
fn text_mode_emits_legacy_error_format() {
let tmp = tempfile::tempdir().unwrap();
// Same trigger: an existing file with malformed TOML.
let bad_config = tmp.path().join("bad.toml");
std::fs::write(&bad_config, b"this is not { valid toml !!!").unwrap();
let mut cmd = Command::new(kebab_bin());
cmd.args(["--config", bad_config.to_str().unwrap(), "ingest"]);
for (k, v) in xdg_envs(tmp.path()) {
cmd.env(k, v);
}
let out = cmd.output().unwrap();
assert!(
!out.status.success(),
"expected non-zero exit for config_invalid"
);
let exit_code = out.status.code().unwrap_or(-1);
assert_eq!(exit_code, 2, "expected exit code 2, got {exit_code}");
let stderr = String::from_utf8(out.stderr).unwrap();
assert!(
stderr.starts_with("error:"),
"text mode stderr must start with 'error:', got: {stderr:?}"
);
assert!(
!stderr.trim_start().starts_with('{'),
"text mode stderr must NOT be JSON, got: {stderr:?}"
);
}

View File

@@ -0,0 +1,92 @@
//! Integration: spawn `kebab ingest-file <path>` and verify ingest_report.v1.
use std::fs;
use std::process::Command;
#[test]
fn cli_ingest_file_emits_ingest_report_v1() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("notes");
let data = dir.path().join("data");
fs::create_dir_all(&workspace).unwrap();
fs::create_dir_all(&data).unwrap();
let cfg_path = dir.path().join("config.toml");
fs::write(
&cfg_path,
format!(
r#"schema_version = 1
[workspace]
root = "{workspace}"
exclude = [".git/**"]
[storage]
data_dir = "{data}"
sqlite = "{{data_dir}}/kebab.sqlite"
vector_dir = "{{data_dir}}/lancedb"
asset_dir = "{{data_dir}}/assets"
artifact_dir = "{{data_dir}}/artifacts"
model_dir = "{{data_dir}}/models"
runs_dir = "{{data_dir}}/runs"
copy_threshold_mb = 100
[indexing]
max_parallel_extractors = 2
max_parallel_embeddings = 1
watch_filesystem = false
[chunking]
target_tokens = 500
overlap_tokens = 80
respect_markdown_headings = true
chunker_version = "md-heading-v1"
[models.embedding]
provider = "none"
model = "none"
version = "v0"
dimensions = 0
batch_size = 1
[models.llm]
provider = "ollama"
model = "none"
context_tokens = 4096
endpoint = "http://127.0.0.1:11434"
temperature = 0.0
seed = 0
[search]
default_k = 10
hybrid_fusion = "rrf"
rrf_k = 60
snippet_chars = 220
[rag]
prompt_template_version = "rag-v1"
score_gate = 0.30
explain_default = false
max_context_tokens = 8000
"#,
workspace = workspace.display(),
data = data.display(),
),
).unwrap();
let src = dir.path().join("doc.md");
fs::write(&src, "# A\n\nbody.").unwrap();
let bin = env!("CARGO_BIN_EXE_kebab");
let out = Command::new(bin)
.args(["--json", "--config", cfg_path.to_str().unwrap(), "ingest-file"])
.arg(&src)
.output()
.unwrap();
assert!(out.status.success(), "stderr: {}", String::from_utf8_lossy(&out.stderr));
let stdout = String::from_utf8_lossy(&out.stdout);
let v: serde_json::Value = serde_json::from_str(stdout.trim()).unwrap();
assert_eq!(v.get("schema_version").and_then(|s| s.as_str()), Some("ingest_report.v1"));
assert_eq!(v.get("new").and_then(|n| n.as_u64()), Some(1));
}

View File

@@ -0,0 +1,100 @@
//! Integration: spawn `kebab ingest-stdin --title X` with stdin pipe.
use std::fs;
use std::io::Write;
use std::process::{Command, Stdio};
#[test]
fn cli_ingest_stdin_emits_ingest_report_v1() {
let dir = tempfile::tempdir().unwrap();
let workspace = dir.path().join("notes");
let data = dir.path().join("data");
fs::create_dir_all(&workspace).unwrap();
fs::create_dir_all(&data).unwrap();
let cfg_path = dir.path().join("config.toml");
fs::write(
&cfg_path,
format!(
r#"schema_version = 1
[workspace]
root = "{workspace}"
exclude = [".git/**"]
[storage]
data_dir = "{data}"
sqlite = "{{data_dir}}/kebab.sqlite"
vector_dir = "{{data_dir}}/lancedb"
asset_dir = "{{data_dir}}/assets"
artifact_dir = "{{data_dir}}/artifacts"
model_dir = "{{data_dir}}/models"
runs_dir = "{{data_dir}}/runs"
copy_threshold_mb = 100
[indexing]
max_parallel_extractors = 2
max_parallel_embeddings = 1
watch_filesystem = false
[chunking]
target_tokens = 500
overlap_tokens = 80
respect_markdown_headings = true
chunker_version = "md-heading-v1"
[models.embedding]
provider = "none"
model = "none"
version = "v0"
dimensions = 0
batch_size = 1
[models.llm]
provider = "ollama"
model = "none"
context_tokens = 4096
endpoint = "http://127.0.0.1:11434"
temperature = 0.0
seed = 0
[search]
default_k = 10
hybrid_fusion = "rrf"
rrf_k = 60
snippet_chars = 220
[rag]
prompt_template_version = "rag-v1"
score_gate = 0.30
explain_default = false
max_context_tokens = 8000
"#,
workspace = workspace.display(),
data = data.display(),
),
).unwrap();
let bin = env!("CARGO_BIN_EXE_kebab");
let mut child = Command::new(bin)
.args([
"--json", "--config", cfg_path.to_str().unwrap(),
"ingest-stdin", "--title", "X",
])
.stdin(Stdio::piped())
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn()
.unwrap();
{
let stdin = child.stdin.as_mut().unwrap();
stdin.write_all(b"## Body\n\nbody text.\n").unwrap();
}
let out = child.wait_with_output().unwrap();
assert!(out.status.success(), "stderr: {}", String::from_utf8_lossy(&out.stderr));
let stdout = String::from_utf8_lossy(&out.stdout);
let v: serde_json::Value = serde_json::from_str(stdout.trim()).unwrap();
assert_eq!(v.get("schema_version").and_then(|s| s.as_str()), Some("ingest_report.v1"));
assert_eq!(v.get("new").and_then(|n| n.as_u64()), Some(1));
}

View File

@@ -0,0 +1,77 @@
//! Spawn `target/debug/kebab mcp` and exercise initialize → tools/list.
//!
//! rmcp 1.6 has no public in-memory test transport, so this is the only
//! end-to-end MCP assertion in the suite. The binary is located via
//! `CARGO_BIN_EXE_kebab` which cargo injects at test compile time.
use std::io::{BufRead, BufReader, Write};
use std::process::{Command, Stdio};
#[test]
fn cli_mcp_initialize_then_tools_list() {
let bin = env!("CARGO_BIN_EXE_kebab");
let mut child = Command::new(bin)
.arg("mcp")
.stdin(Stdio::piped())
.stdout(Stdio::piped())
.stderr(Stdio::null())
.spawn()
.unwrap();
let mut stdin = child.stdin.take().unwrap();
let stdout = child.stdout.take().unwrap();
let mut reader = BufReader::new(stdout);
// rmcp 1.6 defaults to protocol version "2025-03-26" (confirmed by
// manual smoke in Task 10). The server echoes whatever version the
// client sends during the handshake, so this literal must match.
let init_req = r#"{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"test","version":"0"}}}"#;
writeln!(stdin, "{init_req}").unwrap();
writeln!(
stdin,
r#"{{"jsonrpc":"2.0","method":"notifications/initialized"}}"#
)
.unwrap();
writeln!(
stdin,
r#"{{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{{}}}}"#
)
.unwrap();
// Read initialize response.
let mut line = String::new();
reader.read_line(&mut line).unwrap();
let init: serde_json::Value = serde_json::from_str(line.trim()).unwrap();
assert_eq!(
init.get("id").and_then(|i| i.as_i64()),
Some(1),
"unexpected id in initialize response: {init}"
);
assert!(
init.get("result").is_some(),
"initialize result missing: {init}"
);
// Read tools/list response.
line.clear();
reader.read_line(&mut line).unwrap();
let list: serde_json::Value = serde_json::from_str(line.trim()).unwrap();
assert_eq!(
list.get("id").and_then(|i| i.as_i64()),
Some(2),
"unexpected id in tools/list response: {list}"
);
let tools = list["result"]["tools"]
.as_array()
.expect("tools/list result.tools must be an array");
assert_eq!(
tools.len(),
8,
"expected 8 tools (schema, doctor, search, bulk_search, ask, fetch, ingest_file, ingest_stdin), got {}: {list}",
tools.len()
);
// Gracefully close stdin so the server shuts down cleanly.
drop(stdin);
let _ = child.wait().unwrap();
}

View File

@@ -0,0 +1,183 @@
//! Integration tests for `--readonly` and `--quiet` global flags (fb-28).
use std::io::Write;
use std::process::Command;
fn kebab_bin() -> std::path::PathBuf {
let manifest = env!("CARGO_MANIFEST_DIR");
std::path::PathBuf::from(manifest)
.parent()
.unwrap()
.parent()
.unwrap()
.join("target/debug/kebab")
}
fn fixture_workspace() -> (tempfile::TempDir, std::path::PathBuf) {
let tmp = tempfile::tempdir().unwrap();
let ws = tmp.path().join("workspace");
std::fs::create_dir_all(&ws).unwrap();
let mut a = std::fs::File::create(ws.join("a.md")).unwrap();
writeln!(a, "# Alpha\n\nfirst doc").unwrap();
(tmp, ws)
}
fn xdg_envs(tmp_path: &std::path::Path) -> [(&'static str, std::path::PathBuf); 4] {
[
("XDG_CONFIG_HOME", tmp_path.join("cfg")),
("XDG_DATA_HOME", tmp_path.join("data")),
("XDG_CACHE_HOME", tmp_path.join("cache")),
("XDG_STATE_HOME", tmp_path.join("state")),
]
}
#[test]
fn readonly_flag_blocks_ingest() {
let (tmp, ws) = fixture_workspace();
let out = Command::new(kebab_bin())
.args(["--readonly", "ingest", "--root", ws.to_str().unwrap()])
.envs(xdg_envs(tmp.path()))
.output()
.unwrap();
assert_eq!(out.status.code(), Some(1), "expected exit 1");
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.contains("readonly mode"),
"expected 'readonly mode' in stderr, got: {stderr}"
);
}
#[test]
fn readonly_flag_blocks_ingest_file() {
let (tmp, ws) = fixture_workspace();
let file = ws.join("a.md");
let out = Command::new(kebab_bin())
.args(["--readonly", "ingest-file", file.to_str().unwrap()])
.envs(xdg_envs(tmp.path()))
.output()
.unwrap();
assert_eq!(out.status.code(), Some(1), "expected exit 1");
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(stderr.contains("readonly mode"), "stderr: {stderr}");
}
#[test]
fn readonly_flag_blocks_ingest_stdin() {
let (tmp, _ws) = fixture_workspace();
let out = Command::new(kebab_bin())
.args(["--readonly", "ingest-stdin", "--title", "test"])
.env("KEBAB_READONLY", "1")
.envs(xdg_envs(tmp.path()))
.stdin(std::process::Stdio::null())
.output()
.unwrap();
assert_eq!(out.status.code(), Some(1), "expected exit 1");
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(stderr.contains("readonly mode"), "stderr: {stderr}");
}
#[test]
fn readonly_flag_blocks_reset() {
let (tmp, _ws) = fixture_workspace();
let out = Command::new(kebab_bin())
.args(["--readonly", "reset", "--data-only", "--yes"])
.envs(xdg_envs(tmp.path()))
.output()
.unwrap();
assert_eq!(out.status.code(), Some(1), "expected exit 1");
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(stderr.contains("readonly mode"), "stderr: {stderr}");
}
#[test]
fn kebab_readonly_env_blocks_ingest() {
let (tmp, ws) = fixture_workspace();
let out = Command::new(kebab_bin())
.args(["ingest", "--root", ws.to_str().unwrap()])
.env("KEBAB_READONLY", "1")
.envs(xdg_envs(tmp.path()))
.output()
.unwrap();
assert_eq!(out.status.code(), Some(1), "expected exit 1");
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(stderr.contains("readonly mode"), "stderr: {stderr}");
}
#[test]
fn readonly_json_mode_emits_error_v1() {
let (tmp, ws) = fixture_workspace();
let out = Command::new(kebab_bin())
.args(["--readonly", "--json", "ingest", "--root", ws.to_str().unwrap()])
.envs(xdg_envs(tmp.path()))
.output()
.unwrap();
assert_eq!(out.status.code(), Some(1), "expected exit 1");
let stderr = String::from_utf8_lossy(&out.stderr);
let v: serde_json::Value = serde_json::from_str(stderr.trim())
.unwrap_or_else(|e| panic!("expected error.v1 JSON on stderr, got {stderr:?}: {e}"));
assert_eq!(
v.get("schema_version").and_then(|s| s.as_str()),
Some("error.v1"),
"expected schema_version=error.v1"
);
assert_eq!(
v.get("code").and_then(|s| s.as_str()),
Some("readonly_mode"),
"expected code=readonly_mode"
);
}
#[test]
fn quiet_flag_suppresses_progress_stderr() {
let (tmp, ws) = fixture_workspace();
let out = Command::new(kebab_bin())
.args(["--quiet", "ingest", "--root", ws.to_str().unwrap()])
.envs(xdg_envs(tmp.path()))
.output()
.unwrap();
assert!(
out.status.success(),
"exit: {:?}, stderr: {}",
out.status.code(),
String::from_utf8_lossy(&out.stderr)
);
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.is_empty(),
"expected empty stderr with --quiet, got: {stderr}"
);
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(
stdout.contains("scanned"),
"expected report summary on stdout, got: {stdout}"
);
}
#[test]
fn quiet_with_json_stdout_has_report_stderr_is_empty() {
let (tmp, ws) = fixture_workspace();
let out = Command::new(kebab_bin())
.args(["--quiet", "--json", "ingest", "--root", ws.to_str().unwrap()])
.envs(xdg_envs(tmp.path()))
.output()
.unwrap();
assert!(out.status.success(), "stderr: {}", String::from_utf8_lossy(&out.stderr));
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(stderr.is_empty(), "expected empty stderr, got: {stderr}");
let stdout = String::from_utf8_lossy(&out.stdout);
let last_line = stdout.lines().last().unwrap_or("");
let v: serde_json::Value = serde_json::from_str(last_line)
.unwrap_or_else(|e| panic!("expected JSON on stdout last line, got {last_line:?}: {e}"));
assert_eq!(
v.get("schema_version").and_then(|s| s.as_str()),
Some("ingest_report.v1")
);
}

View File

@@ -0,0 +1,134 @@
//! Integration: spawn the kebab binary and parse `kebab schema [--json]`.
//!
//! Each test builds an isolated TempDir-rooted XDG layout, runs
//! `kebab ingest` over an empty workspace (which creates and migrates
//! kebab.sqlite), then exercises `kebab schema` in JSON and text modes.
//! Using an empty workspace avoids the embedding model dependency while
//! still seeding the DB so `open_existing` inside schema_with_config
//! succeeds (a NotIndexed error fires when the DB file is absent).
use std::process::Command;
fn kebab_bin() -> std::path::PathBuf {
let manifest = env!("CARGO_MANIFEST_DIR");
std::path::PathBuf::from(manifest)
.parent()
.unwrap()
.parent()
.unwrap()
.join("target/debug/kebab")
}
fn xdg_envs(tmp: &std::path::Path) -> [(&'static str, std::path::PathBuf); 4] {
[
("XDG_CONFIG_HOME", tmp.join("cfg")),
("XDG_DATA_HOME", tmp.join("data")),
("XDG_CACHE_HOME", tmp.join("cache")),
("XDG_STATE_HOME", tmp.join("state")),
]
}
/// Seed kebab.sqlite by running `kebab ingest` over an empty workspace dir.
/// This is the minimum required for `kebab schema` to succeed: the store
/// uses `open_existing`, which errors when the DB file is absent.
fn seed_db(tmp: &tempfile::TempDir) {
let ws = tmp.path().join("ws");
std::fs::create_dir_all(&ws).unwrap();
let mut cmd = Command::new(kebab_bin());
cmd.args(["ingest", "--root", ws.to_str().unwrap(), "--summary-only"]);
for (k, v) in xdg_envs(tmp.path()) {
cmd.env(k, v);
}
let out = cmd.output().unwrap();
assert!(
out.status.success(),
"seed ingest failed: {}",
String::from_utf8_lossy(&out.stderr)
);
}
#[test]
fn cli_schema_json_emits_schema_v1() {
let tmp = tempfile::tempdir().unwrap();
seed_db(&tmp);
let mut cmd = Command::new(kebab_bin());
cmd.args(["--json", "schema"]);
for (k, v) in xdg_envs(tmp.path()) {
cmd.env(k, v);
}
let out = cmd.output().unwrap();
assert!(
out.status.success(),
"kebab --json schema failed: {}",
String::from_utf8_lossy(&out.stderr)
);
let stdout = String::from_utf8(out.stdout).unwrap();
let v: serde_json::Value = serde_json::from_str(&stdout).expect("stdout must be valid JSON");
assert_eq!(
v.get("schema_version").and_then(|s| s.as_str()),
Some("schema.v1"),
"schema_version must be schema.v1"
);
assert!(
v.get("kebab_version")
.and_then(|s| s.as_str())
.map(|s| !s.is_empty())
.unwrap_or(false),
"kebab_version must be a non-empty string"
);
let caps = v
.get("capabilities")
.and_then(|c| c.as_object())
.expect("capabilities must be a JSON object");
assert_eq!(
caps.get("json_mode").and_then(|b| b.as_bool()),
Some(true),
"capabilities.json_mode must be true"
);
assert_eq!(
caps.get("mcp_server").and_then(|b| b.as_bool()),
Some(true),
"capabilities.mcp_server must be true (fb-30)"
);
}
#[test]
fn cli_schema_text_mode_runs() {
let tmp = tempfile::tempdir().unwrap();
seed_db(&tmp);
let mut cmd = Command::new(kebab_bin());
cmd.args(["schema"]);
for (k, v) in xdg_envs(tmp.path()) {
cmd.env(k, v);
}
let out = cmd.output().unwrap();
assert!(
out.status.success(),
"kebab schema (text) failed: {}",
String::from_utf8_lossy(&out.stderr)
);
let stdout = String::from_utf8(out.stdout).unwrap();
assert!(
stdout.contains("kebab v"),
"text output must contain 'kebab v', got: {stdout}"
);
assert!(
stdout.contains("capabilities"),
"text output must contain 'capabilities', got: {stdout}"
);
assert!(
stdout.contains("models"),
"text output must contain 'models', got: {stdout}"
);
assert!(
stdout.contains("stats"),
"text output must contain 'stats', got: {stdout}"
);
}

View File

@@ -0,0 +1,243 @@
//! Shared CLI integration-test helpers.
//!
//! Each consumer (`tests/wire_search_stale.rs`, `tests/wire_ask_stale.rs`)
//! does `mod common;` and calls these via `common::write_config(...)`,
//! `common::ingest(...)`, `common::backdate_updated_at(...)`.
//!
//! `#![allow(dead_code)]` because each consumer typically uses only a
//! subset of the helpers; rustc would otherwise warn about the unused
//! ones in any single consumer's compilation.
#![allow(dead_code)]
use std::fs;
use std::path::{Path, PathBuf};
use std::process::Command;
/// Build a `config.toml` text under `dir`. `workspace_root` and
/// `data_dir` live inside `dir`. `stale_threshold_days` is plumbed
/// into `[search]` so the staleness post-process can fire.
///
/// Returns `(cfg_path, workspace_dir, data_dir)`.
pub fn write_config(dir: &Path, stale_threshold_days: u32) -> (PathBuf, PathBuf, PathBuf) {
write_config_with_llm_model(dir, stale_threshold_days, "none")
}
/// Like [`write_config`] but lets the caller pin a specific
/// `[models.llm].model` value — needed by `wire_ask_stale.rs` which
/// hits a real Ollama and wants `gemma4:e4b` instead of `none`.
pub fn write_config_with_llm_model(
dir: &Path,
stale_threshold_days: u32,
llm_model: &str,
) -> (PathBuf, PathBuf, PathBuf) {
let workspace = dir.join("workspace");
let data = dir.join("data");
fs::create_dir_all(&workspace).unwrap();
fs::create_dir_all(&data).unwrap();
let cfg_path = dir.join("config.toml");
fs::write(
&cfg_path,
format!(
r#"schema_version = 1
[workspace]
root = "{workspace}"
exclude = [".git/**"]
[storage]
data_dir = "{data}"
sqlite = "{{data_dir}}/kebab.sqlite"
vector_dir = "{{data_dir}}/lancedb"
asset_dir = "{{data_dir}}/assets"
artifact_dir = "{{data_dir}}/artifacts"
model_dir = "{{data_dir}}/models"
runs_dir = "{{data_dir}}/runs"
copy_threshold_mb = 100
[indexing]
max_parallel_extractors = 2
max_parallel_embeddings = 1
watch_filesystem = false
[chunking]
target_tokens = 80
overlap_tokens = 20
respect_markdown_headings = true
chunker_version = "md-heading-v1"
[models.embedding]
provider = "none"
model = "none"
version = "v0"
dimensions = 0
batch_size = 1
[models.llm]
provider = "ollama"
model = "{llm_model}"
context_tokens = 4096
endpoint = "http://127.0.0.1:11434"
temperature = 0.0
seed = 0
[search]
default_k = 10
hybrid_fusion = "rrf"
rrf_k = 60
snippet_chars = 220
stale_threshold_days = {stale_threshold_days}
[rag]
prompt_template_version = "rag-v1"
score_gate = 0.30
explain_default = false
max_context_tokens = 8000
"#,
workspace = workspace.display(),
data = data.display(),
llm_model = llm_model,
stale_threshold_days = stale_threshold_days,
),
)
.unwrap();
(cfg_path, workspace, data)
}
/// Run `kebab ingest --root <workspace>` against the given config.
/// Asserts success — failures abort the calling test.
pub fn ingest(cfg: &Path, workspace: &Path) {
let bin = env!("CARGO_BIN_EXE_kebab");
let out = Command::new(bin)
.args([
"--config",
cfg.to_str().unwrap(),
"ingest",
"--root",
workspace.to_str().unwrap(),
])
.output()
.unwrap();
assert!(
out.status.success(),
"ingest failed: stderr={}",
String::from_utf8_lossy(&out.stderr)
);
}
/// p9-fb-34: invoke `kebab search` with arbitrary trailing flags +
/// query, capture stdout + stderr. Caller is responsible for
/// supplying `--mode lexical` / `--json` etc. as needed; this helper
/// stays unopinionated so a single test can exercise both wire shapes
/// (JSON wrapper + plain stderr hint). Asserts the binary exited 0;
/// non-zero exits fail the test with stderr included.
pub fn run_search_with_args(cfg: &Path, args: &[&str]) -> (String, String) {
let bin = env!("CARGO_BIN_EXE_kebab");
let mut cmd = Command::new(bin);
cmd.arg("--config").arg(cfg).arg("search");
cmd.args(args);
let out = cmd.output().expect("kebab search");
assert!(
out.status.success(),
"search failed: args={args:?} stderr={}",
String::from_utf8_lossy(&out.stderr)
);
(
String::from_utf8_lossy(&out.stdout).to_string(),
String::from_utf8_lossy(&out.stderr).to_string(),
)
}
/// p9-fb-33: invoke `kebab ask --stream --mode lexical <query>` and
/// capture stdout + stderr. Lexical mode skips embeddings (matches
/// `wire_ask_stale.rs::run_ask_lexical`). Caller asserts on the
/// resulting (stdout, stderr) pair.
pub fn run_ask_stream(cfg: &Path, query: &str) -> (String, String) {
let bin = env!("CARGO_BIN_EXE_kebab");
let out = Command::new(bin)
.args([
"--config",
cfg.to_str().unwrap(),
"ask",
"--stream",
"--mode",
"lexical",
query,
])
.output()
.expect("kebab ask --stream");
(
String::from_utf8_lossy(&out.stdout).to_string(),
String::from_utf8_lossy(&out.stderr).to_string(),
)
}
/// p9-fb-33: invoke `kebab --json ask --mode lexical <query>` (no
/// `--stream`) — used by `wire_ask_stream::non_stream_path_unchanged`
/// to confirm the non-streaming JSON path still emits a single
/// `answer.v1` line on stdout. Returns stdout only (mirrors
/// `wire_ask_stale.rs::run_ask_lexical(json=true)` minus the
/// `Output` indirection).
pub fn run_ask_json(cfg: &Path, query: &str) -> String {
let bin = env!("CARGO_BIN_EXE_kebab");
let out = Command::new(bin)
.args([
"--config",
cfg.to_str().unwrap(),
"--json",
"ask",
"--mode",
"lexical",
query,
])
.output()
.expect("kebab ask --json");
String::from_utf8_lossy(&out.stdout).to_string()
}
/// p9-fb-35: invoke `kebab fetch` with arbitrary trailing flags,
/// capture stdout + stderr. Caller is responsible for supplying
/// `--json` (global flag) before the subcommand position via the
/// `args` slice (e.g. `&["--json", "chunk", &id]`). Asserts the
/// binary exited 0; non-zero exits fail the test with stderr
/// included — for negative-path tests (unknown chunk_id etc.) drive
/// the binary directly via `std::process::Command`.
pub fn run_fetch_with_args(cfg: &Path, args: &[&str]) -> (String, String) {
let bin = env!("CARGO_BIN_EXE_kebab");
let mut cmd = Command::new(bin);
cmd.arg("--config").arg(cfg).arg("fetch");
cmd.args(args);
let out = cmd.output().expect("kebab fetch");
assert!(
out.status.success(),
"fetch failed: args={args:?} stderr={}",
String::from_utf8_lossy(&out.stderr)
);
(
String::from_utf8_lossy(&out.stdout).to_string(),
String::from_utf8_lossy(&out.stderr).to_string(),
)
}
/// Rewrite `documents.updated_at` for one workspace path to
/// `now - days_ago` (RFC3339 UTC). Mirrors
/// `kebab-app/tests/common/mod.rs::backdate_document_updated_at`.
/// Asserts exactly one row is updated — typo-proofs the workspace path.
pub fn backdate_updated_at(data_dir: &Path, workspace_path: &str, days_ago: i64) {
let backdated = (time::OffsetDateTime::now_utc() - time::Duration::days(days_ago))
.format(&time::format_description::well_known::Rfc3339)
.expect("format backdated updated_at");
let db_path = data_dir.join("kebab.sqlite");
let conn = rusqlite::Connection::open(&db_path).expect("open kebab.sqlite");
let updated = conn
.execute(
"UPDATE documents SET updated_at = ?1 WHERE workspace_path = ?2",
rusqlite::params![backdated, workspace_path],
)
.expect("UPDATE documents.updated_at");
assert_eq!(
updated, 1,
"backdate_updated_at: expected to update exactly 1 row for {workspace_path}, got {updated}"
);
}

View File

@@ -162,3 +162,32 @@ fn ingest_json_progress_lines_carry_kind_and_ts() {
assert!(saw_scan_started, "missing scan_started event");
assert!(saw_completed, "missing completed event");
}
#[test]
fn kebab_progress_plain_env_emits_append_lines() {
// KEBAB_PROGRESS=plain forces non-TTY branch even in TTY-emulated envs.
// In subprocess tests there's no TTY anyway, so this primarily verifies
// the env var is accepted and the non-TTY path still works.
let (tmp, ws) = fixture_workspace();
let out = Command::new(kebab_bin())
.args(["ingest", "--root", ws.to_str().unwrap()])
.env("KEBAB_PROGRESS", "plain")
.envs(xdg_envs(tmp.path()))
.output()
.unwrap();
assert!(
out.status.success(),
"stderr: {}",
String::from_utf8_lossy(&out.stderr)
);
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.contains("ingest: scanning"),
"expected 'ingest: scanning' in stderr, got: {stderr}"
);
assert!(
stderr.contains("ingest: complete"),
"expected 'ingest: complete' in stderr, got: {stderr}"
);
}

View File

@@ -0,0 +1,102 @@
//! p9-fb-32: CLI ask output — JSON path emits `indexed_at` + `stale`
//! on each citation; plain output prefixes stale citations with
//! `[stale]` (yellow on TTY).
//!
//! These end-to-end checks exercise `kebab ask`, which requires a real
//! Ollama on `127.0.0.1:11434` (same constraint as
//! `kebab-app/tests/ask_smoke.rs`). Both tests are therefore
//! `#[ignore]` by default — run with
//! `cargo test -p kebab-cli --test wire_ask_stale -- --ignored`
//! against a live Ollama.
//!
//! The `[stale]` rendering logic itself is also covered by a unit test
//! in `kebab-cli/src/main.rs` (`tests::plain_marks_stale_citation_*`)
//! that constructs a synthetic `Answer` and pipes it through
//! `render_ask_plain_citations` — that path is the always-on guard.
//!
//! Shared TempDir / ingest / backdate helpers live in
//! `tests/common/mod.rs`; see also `wire_search_stale.rs`.
mod common;
use std::fs;
use std::path::Path;
use std::process::Command;
/// Run `kebab ask` in lexical mode (no embedding required). `json`
/// toggles `--json`. The caller asserts on the resulting stdout.
fn run_ask_lexical(cfg: &Path, query: &str, json: bool) -> std::process::Output {
let bin = env!("CARGO_BIN_EXE_kebab");
let mut cmd = Command::new(bin);
cmd.arg("--config").arg(cfg);
if json {
cmd.arg("--json");
}
cmd.args(["ask", "--mode", "lexical", query]);
cmd.output().unwrap()
}
#[test]
#[ignore = "requires real Ollama on 127.0.0.1:11434"]
fn ask_json_citations_include_indexed_at_and_stale() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, data) = common::write_config_with_llm_model(dir.path(), 30, "gemma4:e4b");
fs::write(workspace.join("a.md"), "# T\n\napples are fruit\n").unwrap();
common::ingest(&cfg, &workspace);
common::backdate_updated_at(&data, "a.md", 60);
// ask returns exit 1 on refusal; the JSON envelope still goes to
// stdout. Don't assert on `status.success()` — accept either path
// and require the citations array to be present + structurally valid.
let out = run_ask_lexical(&cfg, "what about apples", true);
let stdout = String::from_utf8_lossy(&out.stdout);
let answer: serde_json::Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("expected JSON answer, got {stdout:?}: {e}"));
let cits = answer["citations"]
.as_array()
.unwrap_or_else(|| panic!("expected citations array, got {answer}"));
if let Some(cit) = cits.first() {
// Schema fields are always present on a structurally-valid
// AnswerCitation (serde-derived per Task 2 + Task 8).
assert!(
cit.get("indexed_at").is_some(),
"missing indexed_at on citation: {cit}"
);
assert!(
cit.get("stale").is_some(),
"missing stale on citation: {cit}"
);
assert_eq!(
cit["stale"], true,
"doc backdated 60d at threshold 30d must be stale: {cit}"
);
}
// If the model refused with zero citations the schema-shape claim
// is vacuously true; the unit-test path
// (`tests::plain_marks_stale_citation_*` in main.rs) is the
// always-on guard.
}
#[test]
#[ignore = "requires real Ollama on 127.0.0.1:11434"]
fn ask_plain_marks_stale_citation() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, data) = common::write_config_with_llm_model(dir.path(), 30, "gemma4:e4b");
fs::write(workspace.join("a.md"), "# T\n\napples are fruit\n").unwrap();
common::ingest(&cfg, &workspace);
common::backdate_updated_at(&data, "a.md", 60);
// Refusal exits 1 — that's still fine here, the renderer prints
// the citation block before the refusal exit when citations exist.
// If the model refused with zero citations, this test is
// best-effort (skip the assert): the unit-test path in main.rs
// (`tests::plain_marks_stale_citation_*`) is the always-on guard.
let out = run_ask_lexical(&cfg, "what about apples", false);
let stdout = String::from_utf8_lossy(&out.stdout);
if stdout.contains("근거:") {
assert!(
stdout.contains("[stale]"),
"stale tag missing in plain ask output:\n{stdout}"
);
}
}

View File

@@ -0,0 +1,241 @@
//! p9-fb-33: CLI streaming surface — stderr ndjson `answer_event.v1`
//! events while the answer streams; final stdout line is the existing
//! `answer.v1` (backwards compat with the non-`--stream` path).
//!
//! These end-to-end checks exercise `kebab ask --stream`, which
//! requires a real Ollama on `127.0.0.1:11434` (same constraint as
//! `wire_ask_stale.rs` + `kebab-app/tests/ask_smoke.rs`). All three
//! tests are therefore `#[ignore]` by default — run with
//! `cargo test -p kebab-cli --test wire_ask_stream -- --ignored`
//! against a live Ollama with `gemma4:e4b` pulled.
//!
//! The `BrokenPipe → cancel` test (Task 7 of the fb-33 plan) verifies
//! that closing the stderr reader propagates SendError through the
//! pipeline so the child terminates instead of hanging. That's the
//! main thing the integration test layer can prove that unit tests
//! can't — pipeline cancel is a cross-process concern.
//!
//! Shared TempDir / ingest helpers live in `tests/common/mod.rs`.
mod common;
use std::fs;
use std::path::Path;
use serde_json::Value;
/// Drop `[rag].score_gate` to ~0 in the test config so the
/// score-gate refusal path doesn't short-circuit the LLM call.
/// Lexical retrieval against a one-doc corpus produces tiny fusion
/// scores (well below the default 0.30 gate); the pipeline would
/// take the `refuse_score_gate` early-return — which does not emit
/// a `Final` event — making the streaming-event ordering assertion
/// vacuous. Lower the gate so the LLM actually runs.
fn relax_score_gate(cfg: &Path) {
let body = fs::read_to_string(cfg).expect("read config.toml");
let body = body.replace("score_gate = 0.30", "score_gate = 0.0");
fs::write(cfg, body).expect("write relaxed config.toml");
}
#[test]
#[ignore = "requires real Ollama on 127.0.0.1:11434"]
fn stream_emits_ndjson_events_on_stderr() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) =
common::write_config_with_llm_model(dir.path(), 30, "gemma4:e4b");
relax_score_gate(&cfg);
fs::write(
workspace.join("a.md"),
"# T\n\nrust ownership is a memory model.\n",
)
.unwrap();
common::ingest(&cfg, &workspace);
let (stdout, stderr) = common::run_ask_stream(&cfg, "ownership");
// stderr: every non-empty line should parse as JSON with
// schema_version == "answer_event.v1" and a recognized kind.
let mut kinds: Vec<String> = vec![];
for line in stderr.lines() {
if line.trim().is_empty() {
continue;
}
let v: Value = serde_json::from_str(line)
.unwrap_or_else(|e| panic!("non-JSON stderr line: {line:?}: {e}"));
assert_eq!(v["schema_version"], "answer_event.v1");
let kind = v["kind"].as_str().expect("kind").to_string();
assert!(
matches!(kind.as_str(), "retrieval_done" | "token" | "final"),
"unexpected kind: {kind}"
);
assert!(v["ts"].is_string(), "ts must be RFC3339 string");
kinds.push(kind);
}
// First event must be retrieval_done. Last must be final.
// Note: this test only exercises the LLM-running path which always
// closes with `final`. score-gate / no-chunks refusal paths emit
// only `retrieval_done` and skip `final` — that's why the test uses
// `relax_score_gate()` above to force the LLM path. See
// `stream_score_gate_refusal_emits_only_retrieval_done` for the
// refusal-path coverage.
assert_eq!(
kinds.first().map(String::as_str),
Some("retrieval_done"),
"first event must be retrieval_done, all kinds: {kinds:?}"
);
assert_eq!(
kinds.last().map(String::as_str),
Some("final"),
"last event must be final, all kinds: {kinds:?}"
);
// stdout: last line is answer.v1 (backwards compat with the
// non-streaming path — same wire shape, just emitted after the
// ndjson event stream rather than instead of it).
let final_line = stdout
.lines()
.last()
.expect("stdout has at least one line");
let answer: Value =
serde_json::from_str(final_line).expect("stdout final line = answer.v1");
assert_eq!(answer["schema_version"], "answer.v1");
}
#[test]
#[ignore = "requires real Ollama on 127.0.0.1:11434"]
fn non_stream_path_unchanged() {
// Verify that the non-streaming JSON path (no `--stream`) still
// emits a single `answer.v1` line on stdout — fb-33 must not
// perturb the existing wire surface.
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) =
common::write_config_with_llm_model(dir.path(), 30, "gemma4:e4b");
relax_score_gate(&cfg);
fs::write(
workspace.join("a.md"),
"# T\n\nrust ownership is a memory model.\n",
)
.unwrap();
common::ingest(&cfg, &workspace);
let stdout = common::run_ask_json(&cfg, "ownership");
let v: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("expected answer.v1, got {stdout:?}: {e}"));
assert_eq!(v["schema_version"], "answer.v1");
}
// p9-fb-33 (Task 7): BrokenPipe → cancel propagation. Spawn the
// binary, read the first stderr line (retrieval_done), drop the
// reader. The pipeline's next `Token` send returns SendError, the
// cancel branch fires, child.wait() returns instead of blocking
// forever. The key invariant is *liveness* — that `wait()` returns
// in bounded time. Don't assert exit code: refusal is exit 1, but
// the child may also exit 0 if the LLM happened to finish before
// cancel propagated.
#[test]
#[ignore = "requires real Ollama on 127.0.0.1:11434 + writes to a closed pipe"]
fn stream_cancels_when_stderr_closes() {
use std::io::{BufRead, BufReader};
use std::process::{Command, Stdio};
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) =
common::write_config_with_llm_model(dir.path(), 30, "gemma4:e4b");
relax_score_gate(&cfg);
fs::write(
workspace.join("a.md"),
"# T\n\nrust ownership is a memory model. it tracks lifetimes.\n",
)
.unwrap();
common::ingest(&cfg, &workspace);
let bin = env!("CARGO_BIN_EXE_kebab");
let mut child = Command::new(bin)
.args([
"--config",
cfg.to_str().unwrap(),
"ask",
"--stream",
"--mode",
"lexical",
"ownership",
])
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn()
.expect("spawn kebab");
{
let stderr = child.stderr.take().expect("stderr piped");
let mut reader = BufReader::new(stderr);
let mut first = String::new();
reader
.read_line(&mut first)
.expect("read first stderr line");
assert!(
first.contains("\"kind\":\"retrieval_done\""),
"first event must be retrieval_done, got {first:?}"
);
// Drop the reader → child's stderr write end will see
// BrokenPipe on the next write → main thread drops rx →
// worker's pipeline.send returns SendError → cancel.
}
let status = child.wait().expect("child completes after cancel");
// Don't assert specific exit code — refusal is exit 1, but child
// may also exit 0 if the LLM finished before cancel propagated.
// The load-bearing assertion is that wait() returned at all.
let _ = status;
}
// p9-fb-33 (PR #124 round 1, item 4): score-gate refusal path —
// thin doc + unrelated query trips the default 0.30 score gate
// before the LLM runs. The pipeline emits only `retrieval_done`
// on stderr (no `token`, no `final`); stdout still carries the
// canonical `answer.v1` with `grounded=false`.
#[test]
#[ignore = "requires real Ollama on 127.0.0.1:11434"]
fn stream_score_gate_refusal_emits_only_retrieval_done() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) =
common::write_config_with_llm_model(dir.path(), 30, "gemma4:e4b");
// Intentionally NO relax_score_gate — keep the default 0.30
// so the thin-doc + unrelated-query combo trips refusal.
fs::write(
workspace.join("a.md"),
"# Title\n\nrust is a language.\n",
)
.unwrap();
common::ingest(&cfg, &workspace);
let (stdout, stderr) =
common::run_ask_stream(&cfg, "completely unrelated topic about cooking pasta");
let kinds: Vec<String> = stderr
.lines()
.filter(|l| !l.trim().is_empty())
.filter_map(|l| serde_json::from_str::<Value>(l).ok())
.filter_map(|v| v["kind"].as_str().map(String::from))
.collect();
// Refusal path: only retrieval_done, no token, no final.
assert!(
kinds.iter().all(|k| k == "retrieval_done"),
"refusal path must emit only retrieval_done, got {kinds:?}"
);
assert!(
!kinds.is_empty(),
"expected at least one retrieval_done event, got empty stderr"
);
// Stdout still has answer.v1 with grounded=false.
let final_line = stdout
.lines()
.last()
.expect("stdout has at least one line");
let answer: Value =
serde_json::from_str(final_line).expect("answer.v1");
assert_eq!(answer["schema_version"], "answer.v1");
assert_eq!(answer["grounded"], false);
}

View File

@@ -0,0 +1,174 @@
//! p9-fb-42: integration tests for `kebab search --bulk`.
//!
//! Lexical-only — no fastembed / no Ollama. Each test builds its own
//! TempDir KB via `common::write_config` + `common::ingest` and drives
//! `kebab search --bulk` through stdin. Verifies:
//!
//! - Two queries over stdin emit per-query ndjson `bulk_search_item.v1` lines.
//! - Empty stdin returns empty results with zero summary.
//! - Malformed ndjson exits with code 2 (config_invalid).
//! - Input over the 100-item cap fails with "max 100" error message.
//! - Invalid item field (e.g. bad `mode`) emits per-item error and continues.
mod common;
use serde_json::Value;
use std::fs;
use std::io::Write;
use std::process::{Command, Stdio};
fn cargo_bin() -> &'static str {
env!("CARGO_BIN_EXE_kebab")
}
fn run_bulk_with_stdin(cfg: &std::path::Path, stdin_body: &str, json: bool) -> std::process::Output {
let mut cmd = Command::new(cargo_bin());
cmd.arg("--config").arg(cfg).arg("search").arg("--bulk");
if json {
cmd.arg("--json");
}
cmd.stdin(Stdio::piped())
.stdout(Stdio::piped())
.stderr(Stdio::piped());
let mut child = cmd.spawn().expect("spawn kebab");
{
let mut sin = child.stdin.take().expect("stdin");
sin.write_all(stdin_body.as_bytes()).expect("write stdin");
}
child.wait_with_output().expect("wait")
}
fn seed_workspace(workspace: &std::path::Path) {
fs::write(workspace.join("a.md"), "# Alpha\n\nrust async hello").unwrap();
fs::write(workspace.join("b.md"), "# Bravo\n\nbread and kebab").unwrap();
}
// ---------------------------------------------------------------------------
// Test 1: Two queries over stdin emit per-query ndjson
// ---------------------------------------------------------------------------
#[test]
fn two_query_bulk_emits_per_query_ndjson() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
seed_workspace(&workspace);
common::ingest(&cfg, &workspace);
let out = run_bulk_with_stdin(
&cfg,
"{\"query\":\"rust\",\"mode\":\"lexical\"}\n{\"query\":\"kebab\",\"mode\":\"lexical\"}\n",
true,
);
assert!(
out.status.success(),
"stderr: {}",
String::from_utf8_lossy(&out.stderr)
);
let stdout = String::from_utf8_lossy(&out.stdout);
let lines: Vec<&str> = stdout.lines().filter(|l| !l.trim().is_empty()).collect();
assert_eq!(lines.len(), 2, "expected 2 ndjson lines, got {lines:?}");
for line in &lines {
let v: Value = serde_json::from_str(line).expect("valid JSON line");
assert_eq!(v["schema_version"], "bulk_search_item.v1");
assert!(v["response"].is_object());
assert!(v["error"].is_null());
}
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.contains("bulk_summary: total=2 succeeded=2 failed=0"),
"stderr summary missing: {stderr}"
);
}
// ---------------------------------------------------------------------------
// Test 2: Empty stdin returns empty results with zero summary
// ---------------------------------------------------------------------------
#[test]
fn empty_stdin_returns_empty_results_with_zero_summary() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
seed_workspace(&workspace);
common::ingest(&cfg, &workspace);
let out = run_bulk_with_stdin(&cfg, "", true);
assert!(out.status.success());
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(stdout.trim().is_empty(), "expected empty stdout, got: {stdout}");
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(stderr.contains("bulk_summary: total=0 succeeded=0 failed=0"));
}
// ---------------------------------------------------------------------------
// Test 3: Malformed ndjson line emits config_invalid exit 2
// ---------------------------------------------------------------------------
#[test]
fn malformed_ndjson_line_emits_config_invalid_exit_2() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
seed_workspace(&workspace);
common::ingest(&cfg, &workspace);
let out = run_bulk_with_stdin(&cfg, "not json\n", true);
assert_eq!(out.status.code(), Some(2), "expected exit 2");
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.contains("config_invalid") || stderr.contains("parse error"),
"expected config_invalid or parse error in stderr: {stderr}"
);
}
// ---------------------------------------------------------------------------
// Test 4: Over cap input (>100) emits error
// ---------------------------------------------------------------------------
#[test]
fn over_cap_input_emits_error() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
seed_workspace(&workspace);
common::ingest(&cfg, &workspace);
let body: String = (0..101)
.map(|_| "{\"query\":\"x\",\"mode\":\"lexical\"}\n")
.collect();
let out = run_bulk_with_stdin(&cfg, &body, true);
// bulk_search_with_config returns Err — surfaces as exit 1 (anyhow chain)
// or 2 if classified by error_wire. Accept either, but message must mention `max 100`.
assert!(out.status.code().is_some());
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(
stderr.contains("max 100"),
"expected 'max 100' in stderr: {stderr}"
);
}
// ---------------------------------------------------------------------------
// Test 5: Invalid item field (bad mode) emits per-item error and continues
// ---------------------------------------------------------------------------
#[test]
fn invalid_item_field_emits_per_item_error_continues() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
seed_workspace(&workspace);
common::ingest(&cfg, &workspace);
let out = run_bulk_with_stdin(
&cfg,
"{\"query\":\"rust\",\"mode\":\"lexical\"}\n{\"query\":\"x\",\"mode\":\"bogus\"}\n",
true,
);
assert!(out.status.success());
let stdout = String::from_utf8_lossy(&out.stdout);
let lines: Vec<&str> = stdout.lines().filter(|l| !l.trim().is_empty()).collect();
assert_eq!(lines.len(), 2);
let v0: Value = serde_json::from_str(lines[0]).unwrap();
let v1: Value = serde_json::from_str(lines[1]).unwrap();
assert!(v0["error"].is_null());
assert!(v1["error"].is_object());
assert_eq!(v1["error"]["code"], "invalid_input");
let stderr = String::from_utf8_lossy(&out.stderr);
assert!(stderr.contains("succeeded=1 failed=1"));
}

View File

@@ -0,0 +1,100 @@
//! p10-1A-1 Task 13: regression — the 5 original Citation variants
//! (Line, Page, Region, Caption, Time) serialize byte-identically to
//! pre-Task-1 form. No spurious `code`, `line_start`, or `symbol` keys
//! must leak into these variants.
use kebab_core::{Citation, WorkspacePath};
#[test]
fn line_variant_serialization_unchanged() {
let c = Citation::Line {
path: WorkspacePath::new("a.md".into()).unwrap(),
start: 1,
end: 2,
section: Some("§14".into()),
};
let v = serde_json::to_value(&c).unwrap();
assert_eq!(v["kind"], "line");
assert_eq!(v["start"], 1);
assert_eq!(v["end"], 2);
assert_eq!(v["section"], "§14");
// Must not bleed Code-variant keys.
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
assert!(v.get("symbol").is_none(), "symbol must be absent: {v}");
assert!(v.get("code").is_none(), "code must be absent: {v}");
}
#[test]
fn line_variant_null_section_omitted() {
let c = Citation::Line {
path: WorkspacePath::new("b.md".into()).unwrap(),
start: 5,
end: 10,
section: None,
};
let v = serde_json::to_value(&c).unwrap();
assert_eq!(v["kind"], "line");
// `section` with None should be omitted (skip_serializing_if = is_none).
assert!(v.get("section").is_none() || v["section"].is_null());
}
#[test]
fn page_variant_serialization_unchanged() {
let c = Citation::Page {
path: WorkspacePath::new("a.pdf".into()).unwrap(),
page: 13,
section: None,
};
let v = serde_json::to_value(&c).unwrap();
assert_eq!(v["kind"], "page");
assert_eq!(v["page"], 13);
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
assert!(v.get("symbol").is_none(), "symbol must be absent: {v}");
}
#[test]
fn region_variant_serialization_unchanged() {
let c = Citation::Region {
path: WorkspacePath::new("img.png".into()).unwrap(),
x: 10,
y: 20,
w: 100,
h: 200,
};
let v = serde_json::to_value(&c).unwrap();
assert_eq!(v["kind"], "region");
assert_eq!(v["x"], 10);
assert_eq!(v["y"], 20);
assert_eq!(v["w"], 100);
assert_eq!(v["h"], 200);
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
}
#[test]
fn caption_variant_serialization_unchanged() {
let c = Citation::Caption {
path: WorkspacePath::new("a.png".into()).unwrap(),
model: "qwen2.5-vl:7b".into(),
};
let v = serde_json::to_value(&c).unwrap();
assert_eq!(v["kind"], "caption");
assert_eq!(v["model"], "qwen2.5-vl:7b");
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
}
#[test]
fn time_variant_serialization_unchanged() {
let c = Citation::Time {
path: WorkspacePath::new("audio.mp3".into()).unwrap(),
start_ms: 1000,
end_ms: 5000,
speaker: Some("Alice".into()),
};
let v = serde_json::to_value(&c).unwrap();
assert_eq!(v["kind"], "time");
assert_eq!(v["start_ms"], 1000);
assert_eq!(v["end_ms"], 5000);
assert_eq!(v["speaker"], "Alice");
assert!(v.get("line_start").is_none(), "line_start must be absent: {v}");
assert!(v.get("symbol").is_none(), "symbol must be absent: {v}");
}

View File

@@ -0,0 +1,130 @@
//! p9-fb-35: CLI fetch wire shape + plain output + exit codes.
//!
//! Lexical-only — no fastembed / no Ollama. Each test builds its own
//! TempDir KB via `common::write_config` + `common::ingest` and drives
//! `kebab fetch` through `common::run_fetch_with_args`. Verifies:
//!
//! - `--json fetch chunk <id>` emits the `fetch_result.v1` wrapper
//! with `kind = "chunk"` and a populated `chunk` object.
//! - `--json fetch doc <id> --max-tokens N` flips `truncated: true`
//! once the budget binds.
//! - Unknown `chunk_id` exits non-zero and emits an `error.v1`
//! ndjson line on stderr with `code = "chunk_not_found"`.
mod common;
use serde_json::Value;
use std::fs;
#[test]
fn fetch_chunk_json_emits_fetch_result_v1() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
fs::write(workspace.join("a.md"), "# T\n\napples are red.\n").unwrap();
common::ingest(&cfg, &workspace);
// Find chunk_id via search.
let (search_stdout, _) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "--k", "1", "apples"],
);
let search: Value = serde_json::from_str(search_stdout.trim())
.unwrap_or_else(|e| panic!("search not JSON: {search_stdout:?}: {e}"));
let chunk_id = search["hits"][0]["chunk_id"]
.as_str()
.expect("chunk_id on first hit")
.to_string();
let (stdout, _) = common::run_fetch_with_args(
&cfg,
&["--json", "chunk", &chunk_id],
);
let v: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("fetch not JSON: {stdout:?}: {e}"));
assert_eq!(v["schema_version"], "fetch_result.v1");
assert_eq!(v["kind"], "chunk");
assert!(
v["chunk"].is_object(),
"target chunk must be populated: {v}"
);
assert_eq!(v["truncated"], false);
}
#[test]
fn fetch_doc_json_with_max_tokens_truncates() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
let body: String = "Lorem ipsum dolor sit amet. ".repeat(20);
fs::write(workspace.join("big.md"), format!("# Big\n\n{body}\n")).unwrap();
common::ingest(&cfg, &workspace);
// Find doc_id via search.
let (search_stdout, _) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "--k", "1", "Lorem"],
);
let search: Value = serde_json::from_str(search_stdout.trim())
.unwrap_or_else(|e| panic!("search not JSON: {search_stdout:?}: {e}"));
let doc_id = search["hits"][0]["doc_id"]
.as_str()
.expect("doc_id on first hit")
.to_string();
let (stdout, _) = common::run_fetch_with_args(
&cfg,
&["--json", "doc", &doc_id, "--max-tokens", "20"],
);
let v: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("fetch not JSON: {stdout:?}: {e}"));
assert_eq!(v["kind"], "doc");
assert_eq!(
v["truncated"], true,
"20-token cap must trip truncation: {v}"
);
}
#[test]
fn fetch_chunk_unknown_id_exits_with_error_v1() {
let dir = tempfile::tempdir().unwrap();
let (cfg, _workspace, _data) = common::write_config(dir.path(), 30);
// Direct invocation (not via the success-asserting helper) so we
// can read stderr on failure — mirrors the stale_cursor test in
// `wire_search_response.rs`.
let exe = env!("CARGO_BIN_EXE_kebab");
let cfg_str = cfg.to_str().expect("utf8");
let out = std::process::Command::new(exe)
.args([
"--config",
cfg_str,
"--json",
"fetch",
"chunk",
"nonexistent",
])
.output()
.expect("kebab fetch");
assert_ne!(out.status.code(), Some(0), "must exit non-zero");
let stderr = String::from_utf8_lossy(&out.stderr);
let err_line = stderr
.lines()
.find(|l| {
serde_json::from_str::<Value>(l)
.ok()
.and_then(|v| {
v.get("schema_version")
.and_then(|s| s.as_str())
.map(String::from)
})
.as_deref()
== Some("error.v1")
})
.unwrap_or_else(|| panic!("no error.v1 line on stderr: {stderr:?}"));
let v: Value = serde_json::from_str(err_line).expect("error.v1 json");
assert_eq!(
v["code"], "chunk_not_found",
"code must be chunk_not_found: {err_line}"
);
}

View File

@@ -0,0 +1,57 @@
//! p9-fb-37: integration tests for `kebab schema --json` extended stats.
mod common;
use serde_json::Value;
use std::fs;
use std::process::Command;
fn run_schema(cfg: &std::path::Path) -> Value {
let bin = env!("CARGO_BIN_EXE_kebab");
let out = Command::new(bin)
.args(["--config", cfg.to_str().unwrap(), "schema", "--json"])
.output()
.expect("run kebab schema");
assert!(
out.status.success(),
"schema failed: stderr={}",
String::from_utf8_lossy(&out.stderr)
);
serde_json::from_slice(&out.stdout).expect("valid JSON")
}
#[test]
fn schema_stats_includes_breakdowns_on_fresh_corpus() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
// Run a no-op ingest to bring up migrations + create the SQLite file.
fs::write(workspace.join("placeholder.md"), "# placeholder\n").unwrap();
common::ingest(&cfg, &workspace);
let v = run_schema(&cfg);
let stats = &v["stats"];
let m = stats["media_breakdown"].as_object().unwrap();
assert_eq!(m.len(), 5, "5 media keys padded");
for k in &["markdown", "pdf", "image", "audio", "other"] {
assert!(m[*k].is_number(), "media[{k}] is integer");
}
assert!(stats["lang_breakdown"].is_object());
assert!(stats["index_bytes"]["sqlite"].is_number());
assert!(stats["index_bytes"]["lancedb"].is_number());
assert!(stats["stale_doc_count"].is_number());
}
#[test]
fn schema_stats_breakdowns_after_ingest() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
fs::write(workspace.join("a.md"), "---\nlang: en\n---\nhello\n").unwrap();
fs::write(workspace.join("b.md"), "---\nlang: ko\n---\n안녕\n").unwrap();
common::ingest(&cfg, &workspace);
let v = run_schema(&cfg);
let stats = &v["stats"];
assert_eq!(stats["media_breakdown"]["markdown"], 2);
assert!(stats["lang_breakdown"].is_object());
assert!(stats["index_bytes"]["sqlite"].as_u64().unwrap() > 0);
}

View File

@@ -0,0 +1,306 @@
//! p9-fb-36: CLI integration tests for search filter flags.
//!
//! Lexical-only — no fastembed / no Ollama. Each test builds its own
//! TempDir KB via `common::write_config` + `common::ingest` and drives
//! `kebab search` through `common::run_search_with_args` or direct
//! `Command` invocations. Verifies:
//!
//! - `--doc-id <id>` restricts all returned hits to the target document.
//! - `--ingested-after <bad>` exits non-zero and emits `error.v1` on
//! stderr with `code = "config_invalid"`.
//! - `--media md` (alias) normalises to `markdown` and matches `.md` docs.
//! - `--tag <tag>` (repeatable, OR-within) filters by frontmatter tags.
mod common;
use serde_json::Value;
use std::fs;
use std::process::Command;
// ---------------------------------------------------------------------------
// Test 1: --doc-id restricts hits to a single document
// ---------------------------------------------------------------------------
#[test]
fn search_with_doc_id_filter_returns_only_target_doc() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
// Two docs that both contain the search term.
fs::write(workspace.join("a.md"), "# Alpha\n\nrust ownership rules\n").unwrap();
fs::write(workspace.join("b.md"), "# Beta\n\nrust borrow checker\n").unwrap();
common::ingest(&cfg, &workspace);
// First, search without a doc-id filter to find what doc_ids exist.
let (stdout, _) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "rust"],
);
let resp: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("not JSON: {stdout:?}: {e}"));
let hits = resp["hits"].as_array().expect("hits array");
assert!(
hits.len() >= 2,
"expected ≥2 hits from two docs before filter: {resp}"
);
// Grab one doc_id from the results.
let target_doc_id = hits[0]["doc_id"]
.as_str()
.expect("doc_id string")
.to_string();
// Re-search with --doc-id set to the first hit's doc_id.
let (stdout2, _) = common::run_search_with_args(
&cfg,
&[
"--json",
"--mode",
"lexical",
"--doc-id",
&target_doc_id,
"rust",
],
);
let resp2: Value = serde_json::from_str(stdout2.trim())
.unwrap_or_else(|e| panic!("not JSON after filter: {stdout2:?}: {e}"));
let filtered_hits = resp2["hits"].as_array().expect("hits array (filtered)");
assert!(
!filtered_hits.is_empty(),
"expected at least one hit for the target doc"
);
for hit in filtered_hits {
let got = hit["doc_id"].as_str().expect("doc_id string in hit");
assert_eq!(
got, target_doc_id,
"--doc-id filter must restrict all hits to target doc, got {got}"
);
}
}
// ---------------------------------------------------------------------------
// Test 2: --ingested-after with bad RFC3339 → exit non-zero + error.v1
// ---------------------------------------------------------------------------
#[test]
fn search_with_invalid_ingested_after_emits_config_invalid() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
fs::write(workspace.join("a.md"), "# T\n\nrust stuff\n").unwrap();
common::ingest(&cfg, &workspace);
let bin = env!("CARGO_BIN_EXE_kebab");
let out = Command::new(bin)
.args([
"--config",
cfg.to_str().unwrap(),
"--json",
"search",
"--mode",
"lexical",
"--ingested-after",
"not-a-date",
"rust",
])
.output()
.expect("kebab search --ingested-after bad");
assert!(
!out.status.success(),
"expected non-zero exit for invalid --ingested-after, got: status={} stderr={}",
out.status,
String::from_utf8_lossy(&out.stderr)
);
let stderr = String::from_utf8_lossy(&out.stderr);
// Find the error.v1 ndjson line on stderr (one JSON event per line).
let err_line = stderr
.lines()
.find(|l| {
serde_json::from_str::<Value>(l)
.ok()
.and_then(|v| {
v.get("schema_version")
.and_then(|s| s.as_str())
.map(String::from)
})
.as_deref()
== Some("error.v1")
})
.unwrap_or_else(|| panic!("no error.v1 line on stderr: {stderr:?}"));
let v: Value = serde_json::from_str(err_line).expect("error.v1 json");
assert_eq!(
v["code"], "config_invalid",
"code must be config_invalid for bad RFC3339: {err_line}"
);
}
// ---------------------------------------------------------------------------
// Test 3: --media md (alias) normalises to markdown and matches .md docs
// ---------------------------------------------------------------------------
#[test]
fn search_with_media_filter_md_alias_normalizes_to_markdown() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
// Only a markdown file — the `md` alias should match it.
fs::write(workspace.join("notes.md"), "# Notes\n\nrust async programming\n").unwrap();
common::ingest(&cfg, &workspace);
let (stdout, _) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "--media", "md", "rust"],
);
let resp: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("not JSON: {stdout:?}: {e}"));
let hits = resp["hits"].as_array().expect("hits array");
assert!(
!hits.is_empty(),
"--media md must match the markdown doc; got 0 hits: {resp}"
);
}
// ---------------------------------------------------------------------------
// Test 4: --tag (repeatable, OR-within) filters by frontmatter tags
// ---------------------------------------------------------------------------
#[test]
fn search_with_tag_filter_matches_frontmatter_tags() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
// Doc with `rust` tag.
fs::write(
workspace.join("rust_doc.md"),
"---\ntags: [rust, systems]\n---\n# Rust\n\nrust ownership\n",
)
.unwrap();
// Doc without the tag (but same keyword in body so it appears in
// unfiltered results — the tag filter must exclude it).
fs::write(
workspace.join("other_doc.md"),
"# Other\n\nrust programming\n",
)
.unwrap();
common::ingest(&cfg, &workspace);
// Without filter — both docs must produce hits.
let (unfiltered, _) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "rust"],
);
let uresp: Value = serde_json::from_str(unfiltered.trim())
.unwrap_or_else(|e| panic!("not JSON (unfiltered): {unfiltered:?}: {e}"));
let uhits = uresp["hits"].as_array().expect("unfiltered hits array");
assert!(
uhits.len() >= 2,
"expected ≥2 hits before tag filter: {uresp}"
);
// With --tag rust — only the tagged doc's hits should appear.
let (filtered, _) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "--tag", "rust", "rust"],
);
let fresp: Value = serde_json::from_str(filtered.trim())
.unwrap_or_else(|e| panic!("not JSON (tag-filtered): {filtered:?}: {e}"));
let fhits = fresp["hits"].as_array().expect("filtered hits array");
assert!(
!fhits.is_empty(),
"--tag rust must match the tagged doc; got 0 hits: {fresp}"
);
// Every returned hit must come from rust_doc.md (the tagged file).
for hit in fhits {
let path = hit["doc_path"].as_str().unwrap_or("");
assert!(
path.ends_with("rust_doc.md"),
"--tag rust must only return hits from the tagged doc, got path={path}"
);
}
}
// ---------------------------------------------------------------------------
// Test 5: --tag is repeatable (OR-within); two --tag values form an IN-list
// ---------------------------------------------------------------------------
#[test]
fn search_with_two_tag_filters_returns_or_within_tags() {
// Two docs with different tag sets:
// a.md → tags: [rust]
// b.md → tags: [async]
// c.md → no tags (but same keyword in body)
// Search with --tag rust --tag async (OR within --tag).
// Expect a.md and b.md, not c.md.
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
fs::write(
workspace.join("a.md"),
"---\ntags: [rust]\n---\n# A\n\nrust systems programming\n",
)
.unwrap();
fs::write(
workspace.join("b.md"),
"---\ntags: [async]\n---\n# B\n\nrust async programming\n",
)
.unwrap();
fs::write(workspace.join("c.md"), "# C\n\nrust programming\n").unwrap();
common::ingest(&cfg, &workspace);
// Without filter: all three docs produce hits.
let (unfiltered, _) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "rust"],
);
let uresp: Value = serde_json::from_str(unfiltered.trim())
.unwrap_or_else(|e| panic!("not JSON (unfiltered): {unfiltered:?}: {e}"));
let uhits = uresp["hits"].as_array().expect("unfiltered hits array");
assert!(
uhits.len() >= 3,
"expected ≥3 hits before tag filter: {uresp}"
);
// With --tag rust --tag async: only a.md and b.md should appear.
let (filtered, _) = common::run_search_with_args(
&cfg,
&[
"--json", "--mode", "lexical",
"--tag", "rust",
"--tag", "async",
"rust",
],
);
let fresp: Value = serde_json::from_str(filtered.trim())
.unwrap_or_else(|e| panic!("not JSON (two-tag-filtered): {filtered:?}: {e}"));
let fhits = fresp["hits"].as_array().expect("filtered hits array");
assert!(
!fhits.is_empty(),
"--tag rust --tag async must return hits from tagged docs; got 0: {fresp}"
);
// c.md must not appear — it has no tags.
for hit in fhits {
let path = hit["doc_path"].as_str().unwrap_or("");
assert!(
path.ends_with("a.md") || path.ends_with("b.md"),
"--tag rust --tag async must only return a.md or b.md, got path={path}"
);
}
// Both a.md and b.md must appear (OR, not AND).
let paths: Vec<&str> = fhits
.iter()
.filter_map(|h| h["doc_path"].as_str())
.collect();
let has_a = paths.iter().any(|p| p.ends_with("a.md"));
let has_b = paths.iter().any(|p| p.ends_with("b.md"));
assert!(has_a, "--tag rust must include a.md (rust-tagged): paths={paths:?}");
assert!(has_b, "--tag async must include b.md (async-tagged): paths={paths:?}");
}

View File

@@ -0,0 +1,72 @@
//! p10-1A-1 Task 15: CLI accepts --repo and --code-lang flags.
//!
//! These tests verify that clap parses the new flags without error.
//! They drive `kebab search --help` (which exercises flag parsing
//! via clap's help generation path, exiting 0) or use a minimal
//! config + `--json` round-trip to verify the flags reach the wire.
use std::process::Command;
fn kebab() -> Command {
Command::new(env!("CARGO_BIN_EXE_kebab"))
}
/// `kebab search --help` must exit 0 and mention `--repo`.
#[test]
fn cli_search_help_mentions_repo_flag() {
let out = kebab()
.args(["search", "--help"])
.output()
.expect("failed to run kebab");
// clap help exits 0.
assert!(
out.status.success(),
"kebab search --help exited non-zero: {:?}",
out.status
);
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(
stdout.contains("--repo"),
"--repo flag must appear in search help output:\n{stdout}"
);
}
/// `kebab search --help` must exit 0 and mention `--code-lang`.
#[test]
fn cli_search_help_mentions_code_lang_flag() {
let out = kebab()
.args(["search", "--help"])
.output()
.expect("failed to run kebab");
assert!(
out.status.success(),
"kebab search --help exited non-zero: {:?}",
out.status
);
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(
stdout.contains("--code-lang"),
"--code-lang flag must appear in search help output:\n{stdout}"
);
}
/// `kebab search --help` must exit 0 and mention `--media`.
/// Confirms `--media code` value pathway is available (media is
/// a free-form Vec<String> that already accepted arbitrary values).
#[test]
fn cli_search_help_mentions_media_flag() {
let out = kebab()
.args(["search", "--help"])
.output()
.expect("failed to run kebab");
assert!(
out.status.success(),
"kebab search --help exited non-zero: {:?}",
out.status
);
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(
stdout.contains("--media"),
"--media flag must appear in search help output:\n{stdout}"
);
}

View File

@@ -0,0 +1,47 @@
//! p10-1A-1 Task 13: regression — markdown SearchHit omits `repo` and
//! `code_lang` from JSON when both are `None`.
//!
//! Proves that adding optional fields to SearchHit does not silently
//! inject spurious keys into the existing markdown corpus wire shape.
use kebab_core::{
Citation, ChunkId, ChunkerVersion, DocumentId, IndexVersion, RetrievalDetail, ScoreKind,
SearchHit, WorkspacePath,
};
#[test]
fn markdown_hit_omits_repo_and_code_lang() {
let hit = SearchHit {
rank: 1,
chunk_id: ChunkId("c1".into()),
doc_id: DocumentId("d1".into()),
doc_path: WorkspacePath::new("notes/foo.md".into()).unwrap(),
heading_path: vec!["A".into(), "B".into()],
section_label: Some("B".into()),
snippet: "hi".into(),
citation: Citation::Line {
path: WorkspacePath::new("notes/foo.md".into()).unwrap(),
start: 1,
end: 2,
section: None,
},
retrieval: RetrievalDetail::default(),
index_version: IndexVersion("v1".into()),
embedding_model: None,
chunker_version: ChunkerVersion("md-heading-v1".into()),
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
stale: false,
score_kind: ScoreKind::Rrf,
repo: None,
code_lang: None,
};
let s = serde_json::to_string(&hit).unwrap();
assert!(
!s.contains("\"repo\""),
"repo should be absent from markdown hit JSON: {s}"
);
assert!(
!s.contains("\"code_lang\""),
"code_lang should be absent from markdown hit JSON: {s}"
);
}

View File

@@ -0,0 +1,226 @@
//! p9-fb-34: CLI search wire wrapper + budget controls.
//!
//! Lexical-only — no fastembed / no Ollama. Each test builds its own
//! TempDir KB via `common::write_config` + `common::ingest` and drives
//! `kebab search` through `common::run_search_with_args`. Verifies:
//!
//! - `--json` emits the `search_response.v1` wrapper (hits + cursor +
//! truncated).
//! - `--max-tokens` flips `truncated: true` once the budget binds.
//! - `--cursor` advances paging (page 2 chunk_ids disjoint from page 1).
//! - Plain (non-JSON) output prints the `[truncated; ...]` hint to
//! stderr (stdout stays the hit list).
mod common;
use serde_json::Value;
use std::fs;
#[test]
fn search_json_emits_search_response_v1_wrapper() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
fs::write(workspace.join("a.md"), "# T\n\napples are red.\n").unwrap();
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "apples"],
);
let v: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("not JSON: {stdout:?}: {e}"));
assert_eq!(v["schema_version"], "search_response.v1");
assert!(v["hits"].is_array(), "hits must be array, got {v}");
assert!(
v["next_cursor"].is_null() || v["next_cursor"].is_string(),
"next_cursor must be null or string, got {}",
v["next_cursor"]
);
assert!(
v["truncated"].is_boolean(),
"truncated must be bool, got {}",
v["truncated"]
);
}
#[test]
fn search_json_truncates_with_max_tokens() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
let body: String = "rust ownership is a memory model. ".repeat(10);
fs::write(workspace.join("a.md"), format!("# T\n\n{body}\n")).unwrap();
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "--max-tokens", "30", "rust"],
);
let v: Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("not JSON: {stdout:?}: {e}"));
assert_eq!(
v["truncated"], true,
"30-token cap must trip truncation: {v}"
);
}
#[test]
fn search_json_cursor_paginates() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
for i in 0..6 {
fs::write(
workspace.join(format!("d{i}.md")),
format!("# T{i}\n\nrust topic {i}\n"),
)
.unwrap();
}
common::ingest(&cfg, &workspace);
let (page1, _) = common::run_search_with_args(
&cfg,
&["--json", "--mode", "lexical", "--k", "2", "rust"],
);
let v1: Value = serde_json::from_str(page1.trim())
.unwrap_or_else(|e| panic!("page1 not JSON: {page1:?}: {e}"));
let cursor = v1["next_cursor"]
.as_str()
.unwrap_or_else(|| panic!("next_cursor missing on page1: {v1}"));
let (page2, _) = common::run_search_with_args(
&cfg,
&[
"--json",
"--mode",
"lexical",
"--k",
"2",
"--cursor",
cursor,
"rust",
],
);
let v2: Value = serde_json::from_str(page2.trim())
.unwrap_or_else(|e| panic!("page2 not JSON: {page2:?}: {e}"));
let p1_ids: Vec<String> = v1["hits"]
.as_array()
.expect("page1 hits array")
.iter()
.map(|h| {
h["chunk_id"]
.as_str()
.expect("chunk_id string")
.to_string()
})
.collect();
let p2_ids: Vec<String> = v2["hits"]
.as_array()
.expect("page2 hits array")
.iter()
.map(|h| {
h["chunk_id"]
.as_str()
.expect("chunk_id string")
.to_string()
})
.collect();
assert!(
!p2_ids.is_empty(),
"page2 must return at least one hit (cursor advanced past page1)"
);
assert!(
p2_ids.iter().all(|id| !p1_ids.contains(id)),
"page2 must not repeat page1 chunk_ids: page1={p1_ids:?} page2={p2_ids:?}"
);
}
#[test]
fn search_stale_cursor_returns_error_v1_with_stale_cursor_code() {
// p9-fb-34 round-1 review: end-to-end wire contract — when the
// corpus_revision bumps between cursor issuance and the cursored
// search, `kebab --json search --cursor <stale>` must emit an
// `error.v1` ndjson line on stderr with `code = "stale_cursor"`.
// Pre-fix this returned `code = "generic"` because
// `App::search_with_opts` string-formatted the typed payload into
// anyhow, losing the structured wrapper.
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
fs::write(workspace.join("a.md"), "# T\n\napples\n").unwrap();
common::ingest(&cfg, &workspace);
// Get a valid cursor first.
let (page1_stdout, _) = common::run_search_with_args(
&cfg,
&["--mode", "lexical", "--json", "--k", "1", "apples"],
);
let v1: Value = serde_json::from_str(page1_stdout.trim()).expect("json");
let cursor = v1["next_cursor"]
.as_str()
.expect("k=1 page must emit next_cursor — fixture too small if this fails")
.to_string();
// Bump corpus_revision by ingesting a second doc.
fs::write(workspace.join("b.md"), "# B\n\nbananas\n").unwrap();
common::ingest(&cfg, &workspace);
// Use the now-stale cursor. Direct invocation (not via the
// success-asserting helper) so we can read stderr on failure.
let exe = env!("CARGO_BIN_EXE_kebab");
let cfg_str = cfg.to_str().expect("utf8");
let out = std::process::Command::new(exe)
.args([
"--config",
cfg_str,
"--json",
"search",
"--mode",
"lexical",
"--json",
"--cursor",
&cursor,
"apples",
])
.output()
.expect("kebab search --cursor");
let stderr = String::from_utf8_lossy(&out.stderr);
// Find the error.v1 ndjson line on stderr (one event per line).
let err_line = stderr
.lines()
.find(|l| {
serde_json::from_str::<Value>(l)
.ok()
.and_then(|v| {
v.get("schema_version")
.and_then(|s| s.as_str())
.map(String::from)
})
.as_deref()
== Some("error.v1")
})
.unwrap_or_else(|| panic!("no error.v1 line on stderr: {stderr:?}"));
let v: Value = serde_json::from_str(err_line).expect("error.v1 json");
assert_eq!(
v["code"], "stale_cursor",
"code must be stale_cursor: {err_line}"
);
}
#[test]
fn search_plain_emits_truncated_hint_to_stderr() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
let body: String = "rust ownership is a memory model. ".repeat(10);
fs::write(workspace.join("a.md"), format!("# T\n\n{body}\n")).unwrap();
common::ingest(&cfg, &workspace);
let (_stdout, stderr) = common::run_search_with_args(
&cfg,
&["--mode", "lexical", "--max-tokens", "30", "rust"],
);
assert!(
stderr.contains("[truncated;"),
"stderr must carry truncated hint: {stderr:?}"
);
}

View File

@@ -0,0 +1,50 @@
//! p9-fb-38: integration tests for `search_hit.v1.score_kind`.
mod common;
use serde_json::Value;
use std::fs;
fn doc_with_term(workspace: &std::path::Path) {
fs::write(workspace.join("doc1.md"), "# Title\n\nrust async hello\n").unwrap();
}
#[test]
fn lexical_mode_hits_carry_bm25_score_kind() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
doc_with_term(&workspace);
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--mode", "lexical", "--json", "rust"],
);
let v: Value = serde_json::from_str(stdout.trim()).expect("valid JSON");
let hits = v["hits"].as_array().expect("hits array");
assert!(!hits.is_empty(), "expected at least 1 hit");
for h in hits {
assert_eq!(h["score_kind"], "bm25");
}
}
#[test]
fn old_wire_reader_compat_score_kind_optional_field() {
// The wire schema marks `score_kind` as additive (not required).
// We can't easily simulate an old reader from inside Rust, but we
// can confirm the JSON includes the field — old readers that
// ignore unknown fields are unaffected. This test just ensures
// the field is always present in fb-38+ output.
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
doc_with_term(&workspace);
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--mode", "lexical", "--json", "rust"],
);
let v: Value = serde_json::from_str(stdout.trim()).unwrap();
let hit = &v["hits"][0];
assert!(hit.get("score_kind").is_some(), "score_kind always emitted");
}

View File

@@ -0,0 +1,106 @@
//! p9-fb-32: CLI emits `indexed_at` + `stale` on JSON; plain output
//! gains a `[stale]` tag prefix on stale hits.
//!
//! Self-contained: each test builds a TempDir workspace + config,
//! invokes the `kebab` binary via `CARGO_BIN_EXE_kebab`, and (for the
//! plain-output stale path) backdates `documents.updated_at` directly
//! via `rusqlite` to simulate an aged-out doc without faking system
//! time. Mirrors the helper pattern in
//! `crates/kebab-app/tests/common/mod.rs::backdate_document_updated_at`.
//!
//! Shared TempDir / ingest / backdate helpers live in
//! `tests/common/mod.rs`; see also `wire_ask_stale.rs`.
mod common;
use std::fs;
use std::path::Path;
use std::process::Command;
fn run_search_lexical(cfg: &Path, query: &str, json: bool) -> std::process::Output {
let bin = env!("CARGO_BIN_EXE_kebab");
let mut cmd = Command::new(bin);
cmd.arg("--config").arg(cfg);
if json {
cmd.arg("--json");
}
// Force lexical so the test doesn't need fastembed / AVX. Hybrid
// is the CLI default which would try the vector path.
cmd.args(["search", "--mode", "lexical", query]);
let out = cmd.output().unwrap();
assert!(
out.status.success(),
"search failed: stderr={}",
String::from_utf8_lossy(&out.stderr)
);
out
}
#[test]
fn search_json_includes_indexed_at_and_stale() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
fs::write(workspace.join("a.md"), "# Title\n\napples are fruit\n").unwrap();
common::ingest(&cfg, &workspace);
let out = run_search_lexical(&cfg, "apples", true);
let stdout = String::from_utf8_lossy(&out.stdout);
// p9-fb-34: top-level wire is now `search_response.v1` wrapping the
// legacy `search_hit.v1[]` under a `hits` field (with pagination +
// truncation metadata). Hit shape inside `hits` is unchanged.
let resp: serde_json::Value = serde_json::from_str(stdout.trim())
.unwrap_or_else(|e| panic!("expected JSON object, got {stdout:?}: {e}"));
assert_eq!(
resp.get("schema_version").and_then(|v| v.as_str()),
Some("search_response.v1"),
"expected search_response.v1 wrapper, got {resp}"
);
let arr = resp
.get("hits")
.and_then(|h| h.as_array())
.unwrap_or_else(|| panic!("expected hits array, got {stdout}"));
let first = arr.first().unwrap_or_else(|| panic!("expected ≥1 hit, got empty hits: {stdout}"));
assert!(
first.get("indexed_at").is_some(),
"missing indexed_at in {first}"
);
assert!(
first.get("stale").is_some(),
"missing stale in {first}"
);
assert_eq!(
first["stale"], false,
"freshly ingested doc must not be stale at default 30d threshold"
);
}
#[test]
fn search_plain_marks_stale_doc() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, data) = common::write_config(dir.path(), 30);
fs::write(workspace.join("a.md"), "# Title\n\napples are fruit\n").unwrap();
common::ingest(&cfg, &workspace);
common::backdate_updated_at(&data, "a.md", 60);
let out = run_search_lexical(&cfg, "apples", false);
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(
stdout.contains("[stale]"),
"stale tag missing in plain output:\n{stdout}"
);
}
#[test]
fn search_plain_no_stale_tag_for_fresh_doc() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 30);
fs::write(workspace.join("a.md"), "# Title\n\napples are fruit\n").unwrap();
common::ingest(&cfg, &workspace);
let out = run_search_lexical(&cfg, "apples", false);
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(
!stdout.contains("[stale]"),
"unexpected stale tag in plain output for fresh doc:\n{stdout}"
);
}

View File

@@ -0,0 +1,58 @@
//! p9-fb-37: integration tests for `kebab search --trace --json`.
mod common;
use serde_json::Value;
use std::fs;
#[test]
fn search_trace_json_includes_trace_block() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
fs::write(workspace.join("doc1.md"), "# Title\n\nrust async hello\n").unwrap();
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--mode", "lexical", "--trace", "--json", "rust"],
);
let v: Value = serde_json::from_str(stdout.trim()).expect("valid JSON");
assert_eq!(v["schema_version"], "search_response.v1");
assert!(v["trace"].is_object(), "trace block present");
assert!(v["trace"]["timing"].is_object());
assert!(v["trace"]["timing"]["total_ms"].is_number());
assert!(v["trace"]["lexical"].is_array());
assert!(v["trace"]["vector"].is_array());
assert!(v["trace"]["rrf_inputs"].is_array());
}
#[test]
fn search_without_trace_omits_trace_field() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
fs::write(workspace.join("doc1.md"), "# Title\n\nrust async hello\n").unwrap();
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--mode", "lexical", "--json", "rust"],
);
let v: Value = serde_json::from_str(stdout.trim()).expect("valid JSON");
assert!(v.get("trace").is_none(), "trace field absent without --trace");
}
#[test]
fn search_trace_lexical_mode_vector_list_empty() {
let dir = tempfile::tempdir().unwrap();
let (cfg, workspace, _data) = common::write_config(dir.path(), 0);
fs::write(workspace.join("doc1.md"), "# Title\n\nrust async hello\n").unwrap();
common::ingest(&cfg, &workspace);
let (stdout, _stderr) = common::run_search_with_args(
&cfg,
&["--mode", "lexical", "--trace", "--json", "rust"],
);
let v: Value = serde_json::from_str(stdout.trim()).expect("valid JSON");
assert_eq!(v["trace"]["vector"].as_array().unwrap().len(), 0);
assert_eq!(v["trace"]["timing"]["vector_ms"], 0);
}

View File

@@ -13,8 +13,12 @@ kebab-core = { path = "../kebab-core" }
anyhow = { workspace = true }
serde = { workspace = true }
serde_json = { workspace = true }
thiserror = { workspace = true }
toml = "0.8"
dirs = "5"
# p9-fb-05: warn-log when current_dir() fails (chroot, deleted cwd)
# during workspace.root resolution.
tracing = { workspace = true }
[dev-dependencies]
tempfile = { workspace = true }

View File

@@ -11,6 +11,19 @@ use serde::{Deserialize, Serialize};
mod paths;
pub use paths::{expand_path, expand_path_with_base};
/// Signal: `Config::from_file` / `Config::load` failed due to missing path,
/// I/O failure, TOML parse failure, or post-parse validation failure.
///
/// Wrapped into `anyhow::Error` at the API boundary so callers that need
/// structured details (e.g. kebab-cli's `error_classify`) can
/// `downcast_ref::<ConfigInvalid>()` for the wire record.
#[derive(Debug, thiserror::Error)]
#[error("config invalid at {path}: {cause}")]
pub struct ConfigInvalid {
pub path: PathBuf,
pub cause: String,
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct Config {
pub schema_version: u32,
@@ -32,6 +45,11 @@ pub struct Config {
/// `dark`).
#[serde(default = "UiCfg::defaults")]
pub ui: UiCfg,
/// p10-1A-1: code ingest settings. `#[serde(default)]` so existing
/// config files without an `[ingest]` / `[ingest.code]` section
/// load cleanly with built-in defaults.
#[serde(default)]
pub ingest: IngestCfg,
/// p9-fb-05: directory of the on-disk config file this `Config`
/// was loaded from, if any. Populated by `Config::from_file` /
/// `Config::load` — never serialized (`#[serde(skip)]`). Used by
@@ -51,7 +69,6 @@ pub struct Config {
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct WorkspaceCfg {
pub root: String,
pub include: Vec<String>,
pub exclude: Vec<String>,
}
@@ -119,12 +136,21 @@ pub struct SearchCfg {
/// (corpus_revision mismatch) are evicted on next access.
#[serde(default = "default_cache_capacity")]
pub cache_capacity: usize,
/// p9-fb-32: hits and citations whose source doc was last
/// re-processed more than this many days ago are marked
/// `stale: true` in wire / TUI / CLI surfaces. `0` disables.
#[serde(default = "default_stale_threshold_days")]
pub stale_threshold_days: u32,
}
fn default_cache_capacity() -> usize {
256
}
fn default_stale_threshold_days() -> u32 {
30
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct RagCfg {
pub prompt_template_version: String,
@@ -244,6 +270,52 @@ impl UiCfg {
}
}
/// p10-1A-1: top-level ingest configuration wrapper. Contains per-media-type
/// sub-sections; currently only `code` is defined.
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
#[serde(default)]
pub struct IngestCfg {
pub code: IngestCodeCfg,
}
/// p10-1A-1: settings for the code ingest pipeline. All fields have
/// reasonable defaults so the user need not set anything in `config.toml`
/// to get working code ingest.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
#[serde(default)]
pub struct IngestCodeCfg {
/// Generated header sniff. Reads first ~512 bytes, checks 7 markers.
pub skip_generated_header: bool,
/// Max byte size per file. Bigger files skipped.
pub max_file_bytes: u64,
/// Max line count per file. Bigger files skipped (byte cap checked first).
pub max_file_lines: u32,
/// User extra skip globs (gitignore syntax). Applied on top of built-in
/// + `.gitignore` + `.kebabignore`.
pub extra_skip_globs: Vec<String>,
/// AST chunk size cap. Functions/classes longer than this fall back to
/// paragraph-based split (1A-2 and later).
pub ast_chunk_max_lines: u32,
/// Tier 3 fallback chunker: lines per chunk.
pub fallback_lines_per_chunk: u32,
/// Tier 3 fallback chunker: line overlap between adjacent chunks.
pub fallback_lines_overlap: u32,
}
impl Default for IngestCodeCfg {
fn default() -> Self {
Self {
skip_generated_header: true,
max_file_bytes: 262_144,
max_file_lines: 5_000,
extra_skip_globs: vec![],
ast_chunk_max_lines: 200,
fallback_lines_per_chunk: 80,
fallback_lines_overlap: 20,
}
}
}
impl Config {
/// Defaults per design §6.4.
pub fn defaults() -> Self {
@@ -251,7 +323,6 @@ impl Config {
schema_version: 1,
workspace: WorkspaceCfg {
root: "~/KnowledgeBase".to_string(),
include: vec!["**/*.md".to_string()],
exclude: vec![
".git/**".to_string(),
"node_modules/**".to_string(),
@@ -282,9 +353,9 @@ impl Config {
models: ModelsCfg {
embedding: EmbeddingModelCfg {
provider: "fastembed".to_string(),
model: "multilingual-e5-small".to_string(),
model: "multilingual-e5-large".to_string(),
version: "v1".to_string(),
dimensions: 384,
dimensions: 1024,
batch_size: 64,
},
llm: LlmCfg {
@@ -306,15 +377,17 @@ impl Config {
rrf_k: 60,
snippet_chars: 220,
cache_capacity: default_cache_capacity(),
stale_threshold_days: 30,
},
rag: RagCfg {
prompt_template_version: "rag-v1".to_string(),
prompt_template_version: "rag-v2".to_string(),
score_gate: 0.30,
explain_default: false,
max_context_tokens: 8000,
},
image: ImageCfg::defaults(),
ui: UiCfg::defaults(),
ingest: IngestCfg::default(),
// p9-fb-05: defaults are not loaded from disk, so no
// source_dir. Relative `workspace.root` (rare with
// defaults) falls back to caller `cwd` via the
@@ -382,6 +455,25 @@ impl Config {
if p.exists() {
Self::from_file(&p)?
} else {
// macOS migration: if the new XDG path is absent but the
// old ~/Library/Application Support/kebab/config.toml exists,
// copy it to the new location so the user doesn't lose settings.
if let Some(legacy) = Self::macos_legacy_config_path() {
if legacy.exists() && !p.exists() {
if let Some(parent) = p.parent() {
let _ = std::fs::create_dir_all(parent);
}
if std::fs::copy(&legacy, &p).is_ok() {
eprintln!(
"kebab: migrated config {}{}",
legacy.display(),
p.display()
);
return Self::from_file(&p)
.map(|c| c.apply_env(&std::env::vars().collect()));
}
}
}
Self::defaults()
}
}
@@ -395,8 +487,41 @@ impl Config {
/// values resolve against the config file's directory rather
/// than the user's `cwd`.
pub fn from_file(path: &Path) -> anyhow::Result<Self> {
let text = std::fs::read_to_string(path)?;
let mut cfg: Self = toml::from_str(&text)?;
let text = std::fs::read_to_string(path).map_err(|e| {
anyhow::Error::new(ConfigInvalid {
path: path.to_path_buf(),
cause: format!("read_failed: {e}"),
})
})?;
// p9-fb-25: probe for the legacy `workspace.include` key — if
// present, emit a one-shot deprecation warning. Detection uses
// raw `toml::Value` lookup; the warning fires via a process-
// level OnceLock so a long-running TUI / CLI run doesn't spam
// the log on every Config::load.
if let Ok(value) = toml::from_str::<toml::Value>(&text) {
if value
.get("workspace")
.and_then(|v| v.get("include"))
.is_some()
{
static DEPRECATION_FIRED: std::sync::OnceLock<()> = std::sync::OnceLock::new();
DEPRECATION_FIRED.get_or_init(|| {
tracing::warn!(
target: "kebab-config",
config = %path.display(),
"deprecated config: `workspace.include` 필드는 더 이상 사용되지 않습니다 (p9-fb-25, v0.2.1+). 처리 가능한 형식 (md / png / jpg / pdf) 은 extractor 가 자동 결정. config 에서 이 필드를 제거해도 안전 — 더 이상 enforce 안 됨."
);
});
}
}
let mut cfg: Self = toml::from_str(&text).map_err(|e| {
anyhow::Error::new(ConfigInvalid {
path: path.to_path_buf(),
cause: format!("parse_failed: {e}"),
})
})?;
cfg.source_dir = path.parent().map(Path::to_path_buf);
Ok(cfg)
}
@@ -514,6 +639,11 @@ impl Config {
self.search.snippet_chars = n;
}
}
"KEBAB_SEARCH_STALE_THRESHOLD_DAYS" => {
if let Ok(n) = v.parse::<u32>() {
self.search.stale_threshold_days = n;
}
}
// rag
"KEBAB_RAG_PROMPT_TEMPLATE_VERSION" => {
@@ -590,8 +720,11 @@ impl Config {
return PathBuf::from(custom).join("kebab").join("config.toml");
}
}
match dirs::config_dir() {
Some(d) => d.join("kebab").join("config.toml"),
// Always use XDG-standard ~/.config regardless of platform.
// macOS dirs::config_dir() returns ~/Library/Application Support which
// collides with data_dir() — DataOnly reset would delete config too.
match dirs::home_dir() {
Some(h) => h.join(".config").join("kebab").join("config.toml"),
None => PathBuf::from("./kebab/config.toml"),
}
}
@@ -603,8 +736,9 @@ impl Config {
return PathBuf::from(custom).join("kebab");
}
}
match dirs::data_dir() {
Some(d) => d.join("kebab"),
// Always use XDG-standard ~/.local/share regardless of platform.
match dirs::home_dir() {
Some(h) => h.join(".local").join("share").join("kebab"),
None => PathBuf::from("./kebab-data"),
}
}
@@ -616,8 +750,9 @@ impl Config {
return PathBuf::from(custom).join("kebab");
}
}
match dirs::cache_dir() {
Some(d) => d.join("kebab"),
// Always use XDG-standard ~/.cache regardless of platform.
match dirs::home_dir() {
Some(h) => h.join(".cache").join("kebab"),
None => PathBuf::from("./kebab-cache"),
}
}
@@ -636,6 +771,25 @@ impl Config {
}
PathBuf::from("./kebab-state")
}
/// macOS legacy config path: `~/Library/Application Support/kebab/config.toml`.
/// Returns `None` on non-macOS or when home dir is unavailable.
/// Used for one-time migration to the XDG-standard location.
fn macos_legacy_config_path() -> Option<PathBuf> {
#[cfg(target_os = "macos")]
{
dirs::home_dir().map(|h| {
h.join("Library")
.join("Application Support")
.join("kebab")
.join("config.toml")
})
}
#[cfg(not(target_os = "macos"))]
{
None
}
}
}
/// Parse a permissive boolean — `1` / `true` / `yes` (case-insensitive)
@@ -662,10 +816,17 @@ mod tests {
let c = Config::defaults();
assert_eq!(c.rag.score_gate, 0.30);
assert_eq!(c.chunking.target_tokens, 500);
assert_eq!(c.models.embedding.dimensions, 384);
assert_eq!(c.models.embedding.model, "multilingual-e5-large");
assert_eq!(c.models.embedding.dimensions, 1024);
assert_eq!(c.search.rrf_k, 60);
}
#[test]
fn defaults_rag_prompt_template_version_is_rag_v2() {
let c = Config::defaults();
assert_eq!(c.rag.prompt_template_version, "rag-v2");
}
#[test]
fn env_override_score_gate() {
let mut env = HashMap::new();
@@ -839,9 +1000,9 @@ chunker_version = "md-heading-v1"
[models.embedding]
provider = "fastembed"
model = "multilingual-e5-small"
model = "multilingual-e5-large"
version = "v1"
dimensions = 384
dimensions = 1024
batch_size = 64
[models.llm]
@@ -857,9 +1018,10 @@ default_k = 10
hybrid_fusion = "rrf"
rrf_k = 60
snippet_chars = 220
stale_threshold_days = 30
[rag]
prompt_template_version = "rag-v1"
prompt_template_version = "rag-v2"
score_gate = 0.30
explain_default = false
max_context_tokens = 8000
@@ -868,6 +1030,70 @@ max_context_tokens = 8000
assert_eq!(c.image, ImageCfg::defaults());
}
/// p9-fb-25: legacy config with `workspace.include = [...]` must
/// still deserialize cleanly (silent unknown-field acceptance).
#[test]
fn legacy_include_field_is_ignored_silently() {
let mut cfg = Config::defaults();
cfg.workspace.root = "/tmp/kebab-legacy".to_string();
let mut toml_text = toml::to_string(&cfg).expect("default round-trips");
// Inject a legacy `include = [...]` line into the [workspace] block.
toml_text = toml_text.replace(
"[workspace]",
"[workspace]\ninclude = [\"**/*.md\", \"**/*.txt\"]",
);
let parsed: Result<Config, _> = toml::from_str(&toml_text);
assert!(parsed.is_ok(), "legacy include must not break load: {:?}", parsed.err());
let cfg = parsed.unwrap();
assert_eq!(cfg.workspace.root, "/tmp/kebab-legacy");
}
/// p9-fb-25: `WorkspaceCfg` must NOT have an `include` field.
/// Compile-time proof: exhaustive destructure.
#[test]
fn workspace_cfg_has_only_root_and_exclude_fields() {
let ws = Config::defaults().workspace;
let WorkspaceCfg { root: _, exclude: _ } = &ws;
}
#[test]
fn default_stale_threshold_is_30() {
let c = Config::defaults();
assert_eq!(c.search.stale_threshold_days, 30);
}
#[test]
fn env_override_stale_threshold() {
let c = Config::defaults();
let env: HashMap<String, String> = [
("KEBAB_SEARCH_STALE_THRESHOLD_DAYS".to_string(), "7".to_string()),
]
.into_iter()
.collect();
let c = c.apply_env(&env);
assert_eq!(c.search.stale_threshold_days, 7);
}
#[test]
fn env_negative_threshold_silently_ignored() {
// Env path: malformed numeric values (including negatives that
// can't fit `u32`) are silently ignored — same pattern as
// `KEBAB_SEARCH_DEFAULT_K`. The TOML file-load path (covered in
// `fb27_tests::file_negative_stale_threshold_returns_config_invalid`)
// is the spec-required hard error surface.
let c = Config::defaults();
let env: HashMap<String, String> = [
("KEBAB_SEARCH_STALE_THRESHOLD_DAYS".to_string(), "-5".to_string()),
]
.into_iter()
.collect();
let c = c.apply_env(&env);
assert_eq!(
c.search.stale_threshold_days, 30,
"env path: malformed value must leave the default unchanged"
);
}
#[test]
fn xdg_paths_honor_env() {
// Must restore env after the test to avoid polluting other tests.
@@ -886,4 +1112,109 @@ max_context_tokens = 8000
}
}
}
#[test]
fn ingest_code_cfg_defaults() {
let cfg: IngestCodeCfg = toml::from_str("").unwrap();
assert_eq!(cfg.max_file_bytes, 262_144);
assert_eq!(cfg.max_file_lines, 5_000);
assert!(cfg.skip_generated_header);
assert!(cfg.extra_skip_globs.is_empty());
assert_eq!(cfg.ast_chunk_max_lines, 200);
assert_eq!(cfg.fallback_lines_per_chunk, 80);
assert_eq!(cfg.fallback_lines_overlap, 20);
}
#[test]
fn ingest_code_cfg_user_override() {
let toml = r#"
max_file_bytes = 1048576
max_file_lines = 20000
skip_generated_header = false
extra_skip_globs = ["**/fixtures/**", "**/snapshots/**"]
"#;
let cfg: IngestCodeCfg = toml::from_str(toml).unwrap();
assert_eq!(cfg.max_file_bytes, 1_048_576);
assert_eq!(cfg.max_file_lines, 20_000);
assert!(!cfg.skip_generated_header);
assert_eq!(cfg.extra_skip_globs.len(), 2);
}
#[test]
fn config_with_ingest_code_section() {
// Build a full valid Config serialization and patch only the
// [ingest.code] field we care about — avoids having to enumerate
// every required Config field in the test fixture.
let base = Config::defaults();
let mut toml_text = toml::to_string(&base).unwrap();
// Inject max_file_bytes override into the [ingest.code] table.
toml_text = toml_text.replace(
"max_file_bytes = 262144",
"max_file_bytes = 524288",
);
let cfg: Config = toml::from_str(&toml_text).unwrap();
assert_eq!(cfg.ingest.code.max_file_bytes, 524_288);
}
}
#[cfg(test)]
mod fb27_tests {
use super::*;
use std::path::PathBuf;
#[test]
fn config_invalid_carries_path_and_cause() {
let nonexistent = PathBuf::from("/this/path/should/not/exist/kebab.toml");
let err = Config::from_file(&nonexistent).unwrap_err();
let signal = err.downcast_ref::<ConfigInvalid>()
.expect("from_file error should downcast to ConfigInvalid");
assert_eq!(signal.path, nonexistent);
assert!(!signal.cause.is_empty(), "cause should be non-empty");
}
#[test]
fn config_invalid_on_malformed_toml() {
let dir = tempfile::tempdir().unwrap();
let p = dir.path().join("bad.toml");
std::fs::write(&p, "this is not [valid toml").unwrap();
let err = Config::from_file(&p).unwrap_err();
let signal = err.downcast_ref::<ConfigInvalid>()
.expect("malformed TOML should downcast to ConfigInvalid");
assert_eq!(signal.path, p);
assert!(!signal.cause.is_empty(), "cause should be non-empty");
}
/// Spec §Config: a negative `stale_threshold_days` in TOML must be
/// rejected at load time (not silently coerced or ignored). serde's
/// `u32` type-check surfaces the failure as a parse error, which
/// `from_file` wraps into `ConfigInvalid`. CLI's `error_classify`
/// downcasts this and emits `error.v1.code = "config_invalid"`.
#[test]
fn file_negative_stale_threshold_returns_config_invalid() {
let dir = tempfile::tempdir().unwrap();
let p = dir.path().join("neg.toml");
// Build a minimally valid TOML and override only the field
// under test — this isolates the failure to the negative
// value rather than missing required sections.
let cfg = Config::defaults();
let mut toml_text = toml::to_string(&cfg).expect("default round-trips");
assert!(
toml_text.contains("stale_threshold_days = 30"),
"default value drifted; update test fixture"
);
toml_text = toml_text.replace(
"stale_threshold_days = 30",
"stale_threshold_days = -5",
);
std::fs::write(&p, &toml_text).unwrap();
let err = Config::from_file(&p).unwrap_err();
let signal = err.downcast_ref::<ConfigInvalid>()
.expect("negative stale_threshold_days should downcast to ConfigInvalid");
assert_eq!(signal.path, p);
assert!(
signal.cause.contains("parse_failed"),
"expected parse_failed cause, got: {}",
signal.cause
);
}
}

View File

@@ -35,6 +35,11 @@ pub struct Answer {
pub struct AnswerCitation {
pub marker: Option<String>,
pub citation: Citation,
/// p9-fb-32: cited doc's `documents.updated_at`.
#[serde(with = "time::serde::rfc3339")]
pub indexed_at: OffsetDateTime,
/// p9-fb-32: server-computed staleness flag per config threshold.
pub stale: bool,
}
/// p9-fb-15: history 가 prompt 에 들어갈 때의 한 turn. RAG facade 가
@@ -90,3 +95,29 @@ pub struct TokenUsage {
#[derive(Clone, Debug, Eq, Hash, PartialEq, Serialize, Deserialize)]
pub struct TraceId(pub String);
#[cfg(test)]
mod tests {
use super::*;
use crate::asset::WorkspacePath;
use crate::citation::Citation;
use time::macros::datetime;
#[test]
fn answer_citation_serializes_indexed_at_and_stale() {
let ac = AnswerCitation {
marker: Some("[1]".to_string()),
citation: Citation::Line {
path: WorkspacePath::new("a.md".to_string()).unwrap(),
start: 1,
end: 1,
section: None,
},
indexed_at: datetime!(2026-05-09 12:00:00 UTC),
stale: false,
};
let v = serde_json::to_value(&ac).unwrap();
assert_eq!(v["indexed_at"], "2026-05-09T12:00:00Z");
assert_eq!(v["stale"], false);
}
}

View File

@@ -37,6 +37,13 @@ pub enum Citation {
end_ms: u64,
speaker: Option<String>,
},
Code {
path: WorkspacePath,
line_start: u32,
line_end: u32,
symbol: Option<String>,
lang: Option<String>,
},
}
impl Citation {
@@ -46,7 +53,8 @@ impl Citation {
| Citation::Page { path, .. }
| Citation::Region { path, .. }
| Citation::Caption { path, .. }
| Citation::Time { path, .. } => path,
| Citation::Time { path, .. }
| Citation::Code { path, .. } => path,
}
}
@@ -80,6 +88,18 @@ impl Citation {
None => format!("{}#t={},{}", path.0, s, e),
}
}
Citation::Code {
path,
line_start,
line_end,
..
} => {
if line_start == line_end {
format!("{}#L{}", path.0, line_start)
} else {
format!("{}#L{}-L{}", path.0, line_start, line_end)
}
}
}
}
@@ -354,4 +374,64 @@ mod tests {
let r = Citation::parse("notes/x#evil.md#L7");
assert!(r.is_err(), "path with embedded '#' must be rejected");
}
#[test]
fn citation_code_variant_serializes_with_kind_tag() {
let c = Citation::Code {
path: WorkspacePath("crates/kebab-chunk/src/md_heading_v1.rs".into()),
line_start: 142,
line_end: 168,
symbol: Some("MdHeadingV1Chunker::chunk_doc".into()),
lang: Some("rust".into()),
};
let v = serde_json::to_value(&c).unwrap();
assert_eq!(v["kind"], "code");
assert_eq!(v["line_start"], 142);
assert_eq!(v["line_end"], 168);
assert_eq!(v["symbol"], "MdHeadingV1Chunker::chunk_doc");
assert_eq!(v["lang"], "rust");
// Existing 5 variants must NOT pick up these fields.
let line = Citation::Line {
path: WorkspacePath("notes/foo.md".into()),
start: 1,
end: 10,
section: None,
};
let lv = serde_json::to_value(&line).unwrap();
assert!(lv.get("line_start").is_none());
assert!(lv.get("symbol").is_none());
}
#[test]
fn citation_code_uri_format() {
let c = Citation::Code {
path: WorkspacePath("a/b.rs".into()),
line_start: 10,
line_end: 20,
symbol: None,
lang: Some("rust".into()),
};
assert_eq!(c.to_uri(), "a/b.rs#L10-L20");
// Single-line uses `#L10`.
let single = Citation::Code {
path: WorkspacePath("a/b.rs".into()),
line_start: 5,
line_end: 5,
symbol: None,
lang: None,
};
assert_eq!(single.to_uri(), "a/b.rs#L5");
}
#[test]
fn citation_code_path_accessor() {
let c = Citation::Code {
path: WorkspacePath("x.rs".into()),
line_start: 1,
line_end: 1,
symbol: None,
lang: None,
};
assert_eq!(c.path().0, "x.rs");
}
}

View File

@@ -142,6 +142,18 @@ pub enum SourceSpan {
start_ms: u64,
end_ms: u64,
},
/// p10-1A-2: AST-unit span for code ingest. Internal storage shape
/// (chunks.source_spans_json) — `citation_helper` maps this to the
/// wire `Citation::Code` (added 1A-1). `symbol` is the per-language
/// self-reference path (design §3.4); `<top-level>` / `<module>` for
/// glue regions, never null for an identified unit. `lang` is the
/// canonical code_lang.
Code {
line_start: u32,
line_end: u32,
symbol: Option<String>,
lang: Option<String>,
},
}
// ── Forward-declared stubs (§3.7a). Bodies are final per design. ────────
@@ -195,6 +207,24 @@ mod tests {
/// previously failed at serde runtime because `tag = "kind"` cannot
/// describe a newtype carrying a non-struct value. The struct-variant
/// shape used here is the §9 schema migration.
#[test]
fn source_span_code_round_trips_and_tags_lowercase() {
let s = SourceSpan::Code {
line_start: 10,
line_end: 42,
symbol: Some("foo::Bar::baz".to_string()),
lang: Some("rust".to_string()),
};
let v = serde_json::to_value(&s).unwrap();
assert_eq!(v["kind"], "code");
assert_eq!(v["line_start"], 10);
assert_eq!(v["line_end"], 42);
assert_eq!(v["symbol"], "foo::Bar::baz");
assert_eq!(v["lang"], "rust");
let back: SourceSpan = serde_json::from_value(v).unwrap();
assert_eq!(back, s);
}
#[test]
fn inline_serde_round_trip() {
let cases = vec![

View File

@@ -0,0 +1,87 @@
//! p9-fb-35 verbatim fetch domain types.
//!
//! Three modes (chunk / doc / span) carried by [`FetchQuery`]; one
//! response shape ([`FetchResult`]) discriminated by [`FetchKind`].
//! All types are `Serialize` so the CLI / MCP wire layers can hand
//! them straight through `serde_json::to_value`.
use serde::{Deserialize, Serialize};
use time::OffsetDateTime;
use crate::asset::WorkspacePath;
use crate::chunk::Chunk;
use crate::ids::{ChunkId, DocumentId};
#[derive(Clone, Debug)]
pub enum FetchQuery {
Chunk(ChunkId),
Doc(DocumentId),
Span {
doc_id: DocumentId,
line_start: u32,
line_end: u32,
},
}
#[derive(Clone, Debug, Default)]
pub struct FetchOpts {
/// chunk mode only: ±N chunks. None = no surrounding context.
pub context: Option<u32>,
/// doc / span mode only: chars/4 budget. None = no cap.
pub max_tokens: Option<usize>,
}
#[derive(Clone, Copy, Debug, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum FetchKind {
Chunk,
Doc,
Span,
}
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct FetchResult {
pub kind: FetchKind,
pub doc_id: DocumentId,
pub doc_path: WorkspacePath,
#[serde(with = "time::serde::rfc3339")]
pub indexed_at: OffsetDateTime,
pub stale: bool,
// chunk mode payloads
#[serde(skip_serializing_if = "Option::is_none")]
pub chunk: Option<Chunk>,
#[serde(skip_serializing_if = "Vec::is_empty", default)]
pub context_before: Vec<Chunk>,
#[serde(skip_serializing_if = "Vec::is_empty", default)]
pub context_after: Vec<Chunk>,
// doc / span payloads
#[serde(skip_serializing_if = "Option::is_none")]
pub text: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub line_start: Option<u32>,
#[serde(skip_serializing_if = "Option::is_none")]
pub line_end: Option<u32>,
#[serde(skip_serializing_if = "Option::is_none")]
pub effective_end: Option<u32>,
pub truncated: bool,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn fetch_opts_default_is_all_none() {
let o = FetchOpts::default();
assert!(o.context.is_none());
assert!(o.max_tokens.is_none());
}
#[test]
fn fetch_kind_serializes_snake_case() {
let v = serde_json::to_value(FetchKind::Chunk).unwrap();
assert_eq!(v, serde_json::json!("chunk"));
let v = serde_json::to_value(FetchKind::Span).unwrap();
assert_eq!(v, serde_json::json!("span"));
}
}

View File

@@ -20,10 +20,51 @@ pub struct IngestReport {
pub unchanged: u32,
pub errors: u32,
pub duration_ms: u32,
/// p9-fb-25: per-extension skip count. Key = lowercase extension
/// without leading dot (e.g. "docx", "txt"); files without an
/// extension key under "<no-ext>". `BTreeMap` so the wire JSON
/// has stable key order across runs.
pub skipped_by_extension: std::collections::BTreeMap<String, u32>,
/// p10-1A-1: files skipped because they matched a repo-local `.gitignore`.
#[serde(default)]
pub skipped_gitignore: u32,
/// p10-1A-1: files skipped because they matched a `.kebabignore` entry.
#[serde(default)]
pub skipped_kebabignore: u32,
/// p10-1A-1: files skipped because they matched the built-in safety-net
/// blacklist (`node_modules/`, `target/`, `__pycache__/`, `.venv/`,
/// `venv/`, `env/`).
#[serde(default)]
pub skipped_builtin_blacklist: u32,
/// p10-1A-1: files skipped because their first ~512 bytes contained a
/// generated-file marker (`@generated`, `do not edit`, …).
#[serde(default)]
pub skipped_generated: u32,
/// p10-1A-1: files skipped because they exceeded `max_file_bytes` or
/// `max_file_lines` in `[ingest.code]`.
#[serde(default)]
pub skipped_size_exceeded: u32,
/// p10-1A-1: sample file paths per skip category (≤ 5 each).
#[serde(default)]
pub skip_examples: SkipExamples,
/// `None` ↔ wire `items: null` (`--summary-only`).
pub items: Option<Vec<IngestItem>>,
}
/// p10-1A-1: per-category sample of skipped file paths. Each category caps at
/// 5 entries (oldest-first). Used for debugging "why was X not indexed?"
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
pub struct SkipExamples {
#[serde(default)]
pub generated: Vec<String>,
#[serde(default)]
pub size_exceeded: Vec<String>,
#[serde(default)]
pub builtin_blacklist: Vec<String>,
#[serde(default)]
pub gitignore: Vec<String>,
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct IngestItem {
pub kind: IngestItemKind,
@@ -53,3 +94,55 @@ pub enum IngestItemKind {
Unchanged,
Error,
}
#[cfg(test)]
mod tests {
use super::*;
use crate::traits::SourceScope;
#[test]
fn skip_examples_default_is_empty() {
let s = SkipExamples::default();
assert!(s.generated.is_empty());
assert!(s.size_exceeded.is_empty());
assert!(s.builtin_blacklist.is_empty());
assert!(s.gitignore.is_empty());
}
#[test]
fn ingest_report_skip_counters_serialize() {
let r = IngestReport {
scope: SourceScope {
root: std::path::PathBuf::from("/tmp"),
include: vec![],
exclude: vec![],
},
scanned: 100,
new: 50,
updated: 0,
skipped: 0,
unchanged: 0,
errors: 0,
duration_ms: 1234,
skipped_by_extension: Default::default(),
skipped_gitignore: 30,
skipped_kebabignore: 5,
skipped_builtin_blacklist: 10,
skipped_generated: 3,
skipped_size_exceeded: 2,
skip_examples: SkipExamples {
generated: vec!["a/b.pb.rs".into()],
size_exceeded: vec![],
builtin_blacklist: vec!["node_modules/x.js".into()],
gitignore: vec![],
},
items: None,
};
let v = serde_json::to_value(&r).unwrap();
assert_eq!(v["skipped_gitignore"], 30);
assert_eq!(v["skipped_builtin_blacklist"], 10);
assert_eq!(v["skipped_generated"], 3);
assert_eq!(v["skipped_size_exceeded"], 2);
assert_eq!(v["skip_examples"]["generated"][0], "a/b.pb.rs");
}
}

View File

@@ -23,6 +23,7 @@ pub mod vector;
pub mod errors;
pub mod traits;
pub mod normalize;
pub mod fetch;
// Re-export the most commonly used items at the crate root, mirroring the
// public surface listed in the task spec.
@@ -50,14 +51,15 @@ pub use metadata::{
TrustLevel,
};
pub use search::{
DocFilter, DocSummary, RetrievalDetail, SearchFilters, SearchHit,
SearchMode, SearchQuery,
BulkSearchItem, BulkSearchResponse, BulkSearchSummary, DocFilter, DocSummary, IndexBytes, MEDIA_KINDS,
RetrievalDetail, ScoreKind, SearchFilters, SearchHit, SearchMode, SearchOpts, SearchQuery, SearchTrace,
TraceCandidate, TraceFusionInput, TraceTiming,
};
pub use answer::{
Answer, AnswerCitation, AnswerRetrievalSummary, ModelRef, RefusalReason, TokenUsage,
TraceId, Turn,
};
pub use ingest::{IngestItem, IngestItemKind, IngestReport};
pub use ingest::{IngestItem, IngestItemKind, IngestReport, SkipExamples};
pub use jobs::{JobFilter, JobId, JobKind, JobRow, JobStatus};
pub use vector::{VectorHit, VectorRecord};
pub use errors::CoreError;
@@ -68,3 +70,4 @@ pub use traits::{
SourceScope, TokenChunk, VectorStore,
};
pub use normalize::{nfc, to_posix};
pub use fetch::{FetchKind, FetchOpts, FetchQuery, FetchResult};

View File

@@ -40,5 +40,23 @@ pub enum MediaType {
Pdf,
Image(ImageType),
Audio(AudioType),
/// p10-1A-2: a source-code file. Inner string is the canonical
/// code_lang (design §3.5). 1A activates `"rust"` only; other
/// recognized code langs are still routed `Other` until their phase.
Code(String),
Other(String),
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn media_type_code_serializes_lowercase_tagged() {
let m = MediaType::Code("rust".to_string());
let v = serde_json::to_value(&m).unwrap();
assert_eq!(v, serde_json::json!({ "code": "rust" }));
let back: MediaType = serde_json::from_value(v).unwrap();
assert_eq!(back, m);
}
}

View File

@@ -17,6 +17,25 @@ pub struct Metadata {
pub user_id_alias: Option<String>,
/// Frontmatter keys we don't recognise are preserved here per §0 Q9.
pub user: Map<String, Value>,
/// p10-1A-1: name of the source repo if the file lives inside a git
/// working tree (`.git/` walk-up). null otherwise.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub repo: Option<String>,
/// p10-1A-1: HEAD branch at ingest time. null when no repo or detached HEAD.
/// Informational only — current-state observability, not a partition key.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub git_branch: Option<String>,
/// p10-1A-1: HEAD commit (40-hex) at ingest time. null when no repo.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub git_commit: Option<String>,
/// p10-1A-1: programming language identifier (lowercase canonical). null
/// for markdown / pdf / image. Set by `kebab_parse_code::lang::code_lang_for_path`.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub code_lang: Option<String>,
}
#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq, Serialize, Deserialize)]
@@ -66,3 +85,54 @@ pub enum ProvenanceKind {
Warning,
Error,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn metadata_repo_fields_default_to_none_and_omit_when_serialized() {
let m = Metadata {
aliases: vec![],
tags: vec![],
created_at: time::OffsetDateTime::UNIX_EPOCH,
updated_at: time::OffsetDateTime::UNIX_EPOCH,
source_type: SourceType::Markdown,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: None,
git_branch: None,
git_commit: None,
code_lang: None,
};
let v = serde_json::to_value(&m).unwrap();
assert!(v.get("repo").is_none());
assert!(v.get("git_branch").is_none());
assert!(v.get("git_commit").is_none());
assert!(v.get("code_lang").is_none());
}
#[test]
fn metadata_repo_fields_present_when_some() {
let m = Metadata {
aliases: vec![],
tags: vec![],
created_at: time::OffsetDateTime::UNIX_EPOCH,
updated_at: time::OffsetDateTime::UNIX_EPOCH,
source_type: SourceType::Markdown,
trust_level: TrustLevel::Primary,
user_id_alias: None,
user: Default::default(),
repo: Some("kebab".into()),
git_branch: Some("main".into()),
git_commit: Some("a".repeat(40)),
code_lang: Some("rust".into()),
};
let v = serde_json::to_value(&m).unwrap();
assert_eq!(v["repo"], "kebab");
assert_eq!(v["git_branch"], "main");
assert_eq!(v["git_commit"].as_str().unwrap().len(), 40);
assert_eq!(v["code_lang"], "rust");
}
}

View File

@@ -26,12 +26,49 @@ pub struct SearchQuery {
pub filters: SearchFilters,
}
/// p9-fb-36: canonical kind labels for `SearchFilters.media`. Mirrors
/// `MediaType` variant tags; CLI / MCP normalize aliases (`md` → `markdown`)
/// before populating this Vec.
pub const MEDIA_KINDS: &[&str] = &["markdown", "pdf", "image", "audio", "other"];
/// p9-fb-38: top-level `SearchHit.score` declaration.
/// `Rrf` (hybrid) / `Bm25` (lexical-only) / `Cosine` (vector-only).
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum ScoreKind {
#[default]
Rrf,
Bm25,
Cosine,
}
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
pub struct SearchFilters {
pub tags_any: Vec<String>,
pub lang: Option<Lang>,
pub path_glob: Option<String>,
pub trust_min: Option<TrustLevel>,
/// p9-fb-36: media_type filter — IN-list of `MediaType.kind`
/// strings (`"markdown"`, `"pdf"`, `"image"`, `"audio"`, `"other"`).
/// Empty Vec = no filter. Match is on the variant tag only;
/// e.g. `["image"]` matches `Image(Png)` and `Image(Jpeg)`.
#[serde(default)]
pub media: Vec<String>,
/// p9-fb-36: hits whose source doc's `documents.updated_at` is at
/// or after this timestamp. None = no filter. RFC3339 / UTC.
#[serde(default, with = "time::serde::rfc3339::option")]
pub ingested_after: Option<OffsetDateTime>,
/// p9-fb-36: restrict hits to a single document. None = no filter.
#[serde(default)]
pub doc_id: Option<DocumentId>,
/// p10-1A-1: filter by `metadata.repo`. Empty = no filter; multi-value = OR.
#[serde(default)]
pub repo: Vec<String>,
/// p10-1A-1: filter by `metadata.code_lang`. Empty = no filter; multi-value = OR.
/// Identifiers are lowercase canonical names (`rust`, `python`, `typescript`, ...).
/// Unknown values produce empty hits (consistent with `media` policy).
#[serde(default)]
pub code_lang: Vec<String>,
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
@@ -48,6 +85,27 @@ pub struct SearchHit {
pub index_version: IndexVersion,
pub embedding_model: Option<EmbeddingModelId>,
pub chunker_version: ChunkerVersion,
/// p9-fb-32: source doc's `documents.updated_at` (last actual re-process).
/// fb-23 incremental ingest skip path leaves this unchanged.
#[serde(with = "time::serde::rfc3339")]
pub indexed_at: OffsetDateTime,
/// p9-fb-32: server-computed `now - indexed_at > threshold` per
/// `config.search.stale_threshold_days`. `false` when threshold = 0.
pub stale: bool,
/// p9-fb-38: declares the meaning of the top-level `score`.
/// `Rrf` (hybrid mode), `Bm25` (lexical-only), `Cosine` (vector-only).
/// 옛 wire (fb-38 미만) 부재 시 `Rrf` default — hybrid 가 기본 mode.
#[serde(default)]
pub score_kind: ScoreKind,
/// p10-1A-1: optional. Filled when the source file lives in a git repo
/// (`.git/` walk-up). null for markdown / pdf / image hits and for code
/// hits ingested via `kebab ingest-file` outside a repo boundary.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub repo: Option<String>,
/// p10-1A-1: optional. Programming language identifier (lowercase). Set for
/// every code/manifest/k8s chunk; null for markdown / pdf / image hits.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub code_lang: Option<String>,
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
@@ -60,6 +118,19 @@ pub struct RetrievalDetail {
pub vector_rank: Option<u32>,
}
impl Default for RetrievalDetail {
fn default() -> Self {
Self {
method: SearchMode::Hybrid,
fusion_score: 0.0,
lexical_score: None,
vector_score: None,
lexical_rank: None,
vector_rank: None,
}
}
}
/// Filter for `kb-app::list_docs` (§7.2 DocumentStore::list_documents).
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
pub struct DocFilter {
@@ -88,3 +159,376 @@ pub struct DocSummary {
pub parser_version: ParserVersion,
pub chunker_version: ChunkerVersion,
}
/// p9-fb-34: caller-supplied output budget knobs for `App::search_with_opts`.
/// All `None` = no enforcement (existing behavior).
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
pub struct SearchOpts {
/// chars/4 approximation of wire JSON token cost. None = no cap.
pub max_tokens: Option<usize>,
/// Per-hit snippet character cap. None = use config default.
pub snippet_chars: Option<usize>,
/// Opaque base64 cursor from a previous response. None = first page.
pub cursor: Option<String>,
/// p9-fb-37: when true, capture pipeline trace (cache bypassed,
/// lex / vec pre-fusion lists + timing populated on the response).
#[serde(default)]
pub trace: bool,
}
/// p9-fb-37: search retrieval pipeline trace. Populated only when
/// `SearchOpts.trace = true`; `None` on the wrapping `SearchResponse`
/// otherwise. `lexical` / `vector` are pre-fusion candidate lists
/// (each retriever's full output for the fanout query). `rrf_inputs`
/// is the union (chunk_id) used by RRF, with each side's rank
/// captured. `timing` is wall-clock per stage.
#[derive(Clone, Debug, Default, PartialEq, Serialize, Deserialize)]
pub struct SearchTrace {
pub lexical: Vec<TraceCandidate>,
pub vector: Vec<TraceCandidate>,
pub rrf_inputs: Vec<TraceFusionInput>,
pub timing: TraceTiming,
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct TraceCandidate {
pub chunk_id: ChunkId,
pub doc_id: DocumentId,
pub doc_path: WorkspacePath,
pub rank: u32,
pub score: f32,
}
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct TraceFusionInput {
pub chunk_id: ChunkId,
pub lexical_rank: Option<u32>,
pub vector_rank: Option<u32>,
/// Hybrid mode: normalized RRF score in `[0, 1]`.
/// Lexical / Vector mode: equals the underlying retriever's score
/// (no fusion ran). 0.0 for chunks dropped past `target_k`.
pub fusion_score: f32,
}
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq, Serialize, Deserialize)]
pub struct TraceTiming {
pub lexical_ms: u64,
pub vector_ms: u64,
pub fusion_ms: u64,
pub total_ms: u64,
}
/// p9-fb-37: on-disk index size breakdown. Mirrored on the
/// wire `schema.v1.stats.index_bytes` block.
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq, Serialize, Deserialize)]
pub struct IndexBytes {
pub sqlite: u64,
pub lancedb: u64,
}
/// p9-fb-42: per-query result in bulk search. `response` XOR `error` —
/// exactly one is `Some`. `query` is the input echo (raw JSON value)
/// so consumers can correlate input to output without index tracking.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct BulkSearchItem {
pub query: serde_json::Value,
pub response: Option<serde_json::Value>,
pub error: Option<serde_json::Value>,
}
/// p9-fb-42: bulk summary counts. Invariant: total == succeeded + failed.
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq, Serialize, Deserialize)]
pub struct BulkSearchSummary {
pub total: u32,
pub succeeded: u32,
pub failed: u32,
}
/// p9-fb-42: MCP-only envelope. CLI emits raw ndjson without envelope.
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct BulkSearchResponse {
pub schema_version: String,
pub results: Vec<BulkSearchItem>,
pub summary: BulkSearchSummary,
}
#[cfg(test)]
mod tests {
use super::*;
use time::macros::datetime;
#[test]
fn search_hit_serializes_indexed_at_and_stale() {
let hit = SearchHit {
rank: 1,
chunk_id: ChunkId("c".to_string()),
doc_id: DocumentId("d".to_string()),
doc_path: WorkspacePath::new("a/b.md".to_string()).unwrap(),
heading_path: vec!["H".to_string()],
section_label: None,
snippet: "s".to_string(),
citation: Citation::Line {
path: WorkspacePath::new("a/b.md".to_string()).unwrap(),
start: 1,
end: 1,
section: None,
},
retrieval: RetrievalDetail {
method: SearchMode::Lexical,
fusion_score: 0.5,
lexical_score: Some(0.5),
vector_score: None,
lexical_rank: Some(1),
vector_rank: None,
},
index_version: IndexVersion("v1".to_string()),
embedding_model: None,
chunker_version: ChunkerVersion("c1".to_string()),
indexed_at: datetime!(2026-05-09 12:00:00 UTC),
stale: true,
score_kind: ScoreKind::Rrf,
repo: None,
code_lang: None,
};
let v = serde_json::to_value(&hit).unwrap();
assert_eq!(v["indexed_at"], "2026-05-09T12:00:00Z");
assert_eq!(v["stale"], true);
}
#[test]
fn search_opts_default_is_all_none() {
let opts = SearchOpts::default();
assert!(opts.max_tokens.is_none());
assert!(opts.snippet_chars.is_none());
assert!(opts.cursor.is_none());
}
#[test]
fn search_filters_default_includes_new_fb36_fields() {
let f = SearchFilters::default();
assert!(f.media.is_empty(), "media default empty");
assert!(f.ingested_after.is_none(), "ingested_after default None");
assert!(f.doc_id.is_none(), "doc_id default None");
assert!(f.tags_any.is_empty());
assert!(f.lang.is_none());
assert!(f.path_glob.is_none());
assert!(f.trust_min.is_none());
}
#[test]
fn search_filters_serialize_with_serde_default_compat() {
let old: SearchFilters = serde_json::from_str(r#"{"tags_any":[],"lang":null,"path_glob":null,"trust_min":null}"#).unwrap();
assert!(old.media.is_empty());
assert!(old.ingested_after.is_none());
assert!(old.doc_id.is_none());
}
#[test]
fn search_trace_serde_roundtrip() {
let t = SearchTrace {
lexical: vec![TraceCandidate {
chunk_id: ChunkId("c1".into()),
doc_id: DocumentId("d1".into()),
doc_path: WorkspacePath::new("a.md".into()).unwrap(),
rank: 1,
score: 0.42,
}],
vector: vec![],
rrf_inputs: vec![TraceFusionInput {
chunk_id: ChunkId("c1".into()),
lexical_rank: Some(1),
vector_rank: None,
fusion_score: 0.0234,
}],
timing: TraceTiming {
lexical_ms: 12,
vector_ms: 0,
fusion_ms: 1,
total_ms: 14,
},
};
let v = serde_json::to_value(&t).unwrap();
assert_eq!(v["timing"]["lexical_ms"], 12);
assert_eq!(
v["lexical"][0]["score"].as_f64().unwrap() as f32,
0.42_f32
);
let back: SearchTrace = serde_json::from_value(v).unwrap();
assert_eq!(back, t);
}
#[test]
fn index_bytes_default_is_zero() {
let b = IndexBytes::default();
assert_eq!(b.sqlite, 0);
assert_eq!(b.lancedb, 0);
}
#[test]
fn search_opts_trace_default_false() {
let opts = SearchOpts::default();
assert!(!opts.trace);
}
#[test]
fn score_kind_serde_roundtrip() {
use ScoreKind::*;
for (kind, expected) in [(Rrf, "rrf"), (Bm25, "bm25"), (Cosine, "cosine")] {
let v = serde_json::to_value(kind).unwrap();
assert_eq!(v.as_str(), Some(expected));
let back: ScoreKind = serde_json::from_value(v).unwrap();
assert_eq!(back, kind);
}
}
#[test]
fn score_kind_default_is_rrf() {
assert_eq!(ScoreKind::default(), ScoreKind::Rrf);
}
#[test]
fn search_hit_deserialize_without_score_kind_defaults_to_rrf() {
let json = serde_json::json!({
"rank": 1,
"chunk_id": "c1",
"doc_id": "d1",
"doc_path": "a.md",
"heading_path": [],
"section_label": null,
"snippet": "x",
"citation": { "kind": "line", "path": "a.md", "start": 1, "end": 1, "section": null },
"retrieval": {
"method": "lexical",
"fusion_score": 0.5,
"lexical_score": 0.5,
"vector_score": null,
"lexical_rank": 1,
"vector_rank": null
},
"index_version": "v1",
"embedding_model": null,
"chunker_version": "c1",
"indexed_at": "2026-05-10T12:00:00Z",
"stale": false
});
let hit: SearchHit = serde_json::from_value(json).unwrap();
assert_eq!(hit.score_kind, ScoreKind::Rrf);
}
#[test]
fn bulk_search_summary_serde_roundtrip() {
let s = BulkSearchSummary {
total: 5,
succeeded: 4,
failed: 1,
};
let v = serde_json::to_value(s).unwrap();
assert_eq!(v["total"], 5);
assert_eq!(v["succeeded"], 4);
assert_eq!(v["failed"], 1);
let back: BulkSearchSummary = serde_json::from_value(v).unwrap();
assert_eq!(back, s);
}
#[test]
fn bulk_search_summary_default_is_zeros() {
let s = BulkSearchSummary::default();
assert_eq!(s.total, 0);
assert_eq!(s.succeeded, 0);
assert_eq!(s.failed, 0);
}
#[test]
fn bulk_search_item_serde_response_variant() {
let item = BulkSearchItem {
query: serde_json::json!({"query": "rust"}),
response: Some(serde_json::json!({"hits": []})),
error: None,
};
let v = serde_json::to_value(&item).unwrap();
assert!(v["response"].is_object());
assert!(v["error"].is_null());
}
#[test]
fn bulk_search_item_serde_error_variant() {
let item = BulkSearchItem {
query: serde_json::json!({"query": "rust"}),
response: None,
error: Some(serde_json::json!({"code": "config_invalid", "message": "bad"})),
};
let v = serde_json::to_value(&item).unwrap();
assert!(v["response"].is_null());
assert_eq!(v["error"]["code"], "config_invalid");
}
#[test]
fn search_hit_repo_and_code_lang_are_optional_and_omit_when_none() {
let hit = SearchHit {
rank: 1,
chunk_id: ChunkId("c1".into()),
doc_id: DocumentId("d1".into()),
doc_path: WorkspacePath("a.md".into()),
heading_path: vec![],
section_label: None,
snippet: "".into(),
citation: Citation::Line {
path: WorkspacePath("a.md".into()),
start: 1,
end: 2,
section: None,
},
retrieval: RetrievalDetail::default(),
index_version: IndexVersion("v1".into()),
embedding_model: None,
chunker_version: ChunkerVersion("md-heading-v1".into()),
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
stale: false,
score_kind: ScoreKind::Rrf,
repo: None,
code_lang: None,
};
let v = serde_json::to_value(&hit).unwrap();
assert!(v.get("repo").is_none(), "repo should be omitted when None");
assert!(v.get("code_lang").is_none(), "code_lang should be omitted when None");
}
#[test]
fn search_hit_repo_and_code_lang_present_when_some() {
let hit = SearchHit {
rank: 1,
chunk_id: ChunkId("c1".into()),
doc_id: DocumentId("d1".into()),
doc_path: WorkspacePath("a.rs".into()),
heading_path: vec![],
section_label: None,
snippet: "".into(),
citation: Citation::Code {
path: WorkspacePath("a.rs".into()),
line_start: 1,
line_end: 2,
symbol: None,
lang: Some("rust".into()),
},
retrieval: RetrievalDetail::default(),
index_version: IndexVersion("v1".into()),
embedding_model: None,
chunker_version: ChunkerVersion("code-rust-ast-v1".into()),
indexed_at: time::OffsetDateTime::UNIX_EPOCH,
stale: false,
score_kind: ScoreKind::Rrf,
repo: Some("kebab".into()),
code_lang: Some("rust".into()),
};
let v = serde_json::to_value(&hit).unwrap();
assert_eq!(v["repo"], "kebab");
assert_eq!(v["code_lang"], "rust");
}
#[test]
fn search_filters_repo_and_code_lang_default_to_empty_vec() {
let f = SearchFilters::default();
assert!(f.repo.is_empty());
assert!(f.code_lang.is_empty());
}
}

View File

@@ -98,6 +98,11 @@ pub enum FinishReason {
Stop,
Length,
Aborted,
/// p9-fb-33: caller-side cancel. The pipeline breaks the LM loop
/// when a `Token` send into `AskOpts.stream_sink` returns
/// `SendError` (receiver dropped). The persisted answer is
/// flagged with `RefusalReason::LlmStreamAborted`.
Cancelled,
Error(String),
}

View File

@@ -5,14 +5,14 @@ edition = { workspace = true }
rust-version = { workspace = true }
license = { workspace = true }
repository = { workspace = true }
description = "Local fastembed-rs adapter implementing kb_core::Embedder (multilingual-e5-small default)"
description = "Local fastembed-rs adapter implementing kb_core::Embedder (multilingual-e5-large default, e5-small backwards-compat)"
[dependencies]
kebab-config = { path = "../kebab-config" }
kebab-embed = { path = "../kebab-embed" }
# Default features bring `ort-download-binaries` (bundled ONNX runtime)
# and `hf-hub-native-tls` (first-run model download). No extra features
# needed for the multilingual-e5-small path.
# needed for the multilingual-e5-{small,large} paths.
fastembed = { workspace = true }
tracing = { workspace = true }
anyhow = { workspace = true }

View File

@@ -1,8 +1,9 @@
//! `kb-embed-local` — `FastembedEmbedder`, a local ONNX-backed
//! [`Embedder`](kebab_embed::Embedder) implementation.
//!
//! Wraps [`fastembed::TextEmbedding`] for the default `multilingual-e5-small`
//! (384-dim) model. Honors `config.models.embedding.batch_size` and applies
//! Wraps [`fastembed::TextEmbedding`]. Default is `multilingual-e5-large`
//! (1024-dim, p9-fb-39b); `multilingual-e5-small` (384-dim) is also supported
//! for backwards-compat. Honors `config.models.embedding.batch_size` and applies
//! the e5 prefix convention (§11.3 of the design report):
//!
//! * `EmbeddingKind::Document` → `"passage: "` prefix
@@ -69,9 +70,9 @@ impl FastembedEmbedder {
.with_context(|| format!("create fastembed cache dir {}", cache_dir.display()))?;
// 2. Resolve the fastembed enum variant from
// `config.models.embedding.model`. Currently only the default
// `multilingual-e5-small` is wired; other model names error
// out with a clear message rather than silently misconfiguring.
// `config.models.embedding.model`. Currently `multilingual-e5-large`
// (default) and `multilingual-e5-small` are wired; other model names
// error out with a clear message rather than silently misconfiguring.
let model_name = resolve_model(&config.models.embedding.model)?;
// 3. Verify dim match BEFORE loading the model — if the config
@@ -100,7 +101,7 @@ impl FastembedEmbedder {
target: "kebab-embed-local",
model = %config.models.embedding.model,
cache_dir = %cache_dir.display(),
"loading embedding model (first run will download ~470MB)"
"loading embedding model (first run downloads model weights — ~470MB for e5-small, ~1.3GB for e5-large)"
);
let inner = TextEmbedding::try_new(opts)
.context("fastembed: TextEmbedding::try_new")?;
@@ -193,17 +194,18 @@ fn prefix_input(input: &EmbeddingInput<'_>) -> String {
}
/// Resolve a `config.models.embedding.model` string to a fastembed
/// `EmbeddingModel` enum variant. Only `multilingual-e5-small` is wired
/// for p3-2; additional model names should be added (and their dims
/// pinned in tests) as needed.
/// `EmbeddingModel` enum variant. Currently supports `multilingual-e5-small`
/// (384-dim) and `multilingual-e5-large` (1024-dim); additional model names
/// should be added (and their dims pinned in tests) as needed.
fn resolve_model(name: &str) -> Result<EmbeddingModel> {
match name {
"multilingual-e5-small" => Ok(EmbeddingModel::MultilingualE5Small),
"multilingual-e5-large" => Ok(EmbeddingModel::MultilingualE5Large),
other => anyhow::bail!(
"kb-embed-local: unsupported embedding model {other:?}; \
this adapter currently only ships `multilingual-e5-small`. \
Add a new arm to `resolve_model` (and a fastembed feature \
flag if needed) to support more."
this adapter currently ships `multilingual-e5-small` and \
`multilingual-e5-large`. Add a new arm to `resolve_model` \
(and a fastembed feature flag if needed) to support more."
),
}
}
@@ -294,6 +296,12 @@ mod tests {
resolve_model("multilingual-e5-small").expect("default model resolves");
}
#[test]
fn resolve_model_supports_e5_large() {
let m = resolve_model("multilingual-e5-large").expect("e5-large should resolve");
let _ = m;
}
#[test]
fn resolve_unknown_model_errors() {
let err = resolve_model("not-a-real-model").expect_err("unknown model errors");
@@ -301,6 +309,21 @@ mod tests {
assert!(msg.contains("unsupported embedding model"), "msg={msg}");
}
// ── check_dim ────────────────────────────────────────────────────
#[test]
fn check_dim_passes_for_1024() {
check_dim(1024, 1024).expect("matching dims must pass");
}
#[test]
fn check_dim_rejects_384_vs_1024() {
let err = check_dim(384, 1024).expect_err("dim mismatch must error");
let msg = format!("{err}");
assert!(msg.contains("384") && msg.contains("1024"),
"error must mention both dims, got: {msg}");
}
// expand_path tests live in `kb-config::paths`. The adapter imports
// it and trusts the upstream coverage rather than duplicating it.
}

View File

@@ -3,10 +3,11 @@
//!
//! ## Why every test in this file is `#[ignore]`
//!
//! The first call to `FastembedEmbedder::new` downloads ~470 MB of
//! weights from Hugging Face into `data_dir/models/fastembed/`. Doing
//! that on every `cargo test` invocation is wasteful, so the bare
//! invocation skips this file entirely.
//! The first call to `FastembedEmbedder::new` downloads ~1.3 GB of
//! weights (multilingual-e5-large per p9-fb-39b default) from Hugging
//! Face into `data_dir/models/fastembed/`. Doing that on every
//! `cargo test` invocation is wasteful, so the bare invocation skips
//! this file entirely.
//!
//! Run the full suite with:
//! ```text
@@ -58,19 +59,20 @@ fn shared_embedder() -> &'static FastembedEmbedder {
// ─── construction ─────────────────────────────────────────────────────
#[test]
#[ignore = "downloads ~470MB ONNX model on first run; CI-only"]
fn default_config_constructs_with_dims_384() {
#[ignore = "downloads ~1.3GB ONNX model on first run; CI-only"]
fn default_config_constructs_with_dims_1024() {
// p9-fb-39b: default flipped to multilingual-e5-large (1024 dim).
let emb = shared_embedder();
assert_eq!(emb.dimensions(), 384);
assert_eq!(emb.model_id().0, "multilingual-e5-small");
assert_eq!(emb.dimensions(), 1024);
assert_eq!(emb.model_id().0, "multilingual-e5-large");
assert_eq!(emb.model_version().0, "v1");
}
#[test]
#[ignore = "downloads ~470MB ONNX model on first run; CI-only"]
#[ignore = "downloads ~1.3GB ONNX model on first run; CI-only"]
fn mismatched_dims_in_config_errors_at_construction() {
let (mut cfg, _tmp) = test_config();
cfg.models.embedding.dimensions = 512; // model is 384
cfg.models.embedding.dimensions = 512; // model is 1024 (e5-large default)
// `FastembedEmbedder` deliberately does not implement `Debug`
// (its inner ONNX session has no useful debug shape), so we
// can't use `expect_err`; match the Result manually.
@@ -80,7 +82,7 @@ fn mismatched_dims_in_config_errors_at_construction() {
};
let msg = format!("{err}");
assert!(msg.contains("dimension mismatch"), "msg={msg}");
assert!(msg.contains("384"), "msg={msg}");
assert!(msg.contains("1024"), "msg={msg}");
assert!(msg.contains("512"), "msg={msg}");
}
@@ -104,8 +106,8 @@ fn document_and_query_yield_different_vectors() {
])
.expect("embed two inputs");
assert_eq!(out.len(), 2);
assert_eq!(out[0].len(), 384);
assert_eq!(out[1].len(), 384);
assert_eq!(out[0].len(), 1024);
assert_eq!(out[1].len(), 1024);
// Both vectors are L2-normalized → cosine similarity == dot product.
let cos: f32 = out[0]
@@ -142,11 +144,11 @@ fn output_vectors_are_l2_normalized() {
];
let out = emb.embed(&inputs).expect("embed");
// Per `kebab_embed::assert_unit_norm` docs: `5e-4` is the safe bound at
// 384 dims (f32::EPSILON ×384 ≈ 2.3e-6, but ONNX kernels add
// 1024 dims (f32::EPSILON ×1024 ≈ 2.3e-6, but ONNX kernels add
// their own per-component noise; 1e-3 is very generous and matches
// the spec's `± 1e-3`).
kebab_embed::assert_unit_norm(&out, 1e-3);
kebab_embed::assert_vector_shape(&out, 384);
kebab_embed::assert_vector_shape(&out, 1024);
}
// ─── determinism ──────────────────────────────────────────────────────
@@ -254,7 +256,7 @@ fn snapshot_aggregate_hash_is_stable() {
// Round every component to 4 decimal places, hash deterministically.
let mut hasher = DefaultHasher::new();
for (i, v) in out.iter().enumerate() {
assert_eq!(v.len(), 384, "row {i} dim mismatch");
assert_eq!(v.len(), 1024, "row {i} dim mismatch");
for x in v {
let rounded: i32 = (*x * 1.0e4).round() as i32;
rounded.hash(&mut hasher);

View File

@@ -184,6 +184,18 @@ pub fn render_report_md(report: &CompareReport) -> String {
),
);
}
for k in crate::metrics::TOP_K_VARIANTS {
let _ = writeln!(
out,
"| precision@{k}_chunk | {} | {} | {} |",
fmt(a.precision_at_k_chunk.get(k).copied().unwrap_or(f32::NAN)),
fmt(b.precision_at_k_chunk.get(k).copied().unwrap_or(f32::NAN)),
fmt_delta(
a.precision_at_k_chunk.get(k).copied().unwrap_or(f32::NAN),
b.precision_at_k_chunk.get(k).copied().unwrap_or(f32::NAN),
),
);
}
let _ = writeln!(
out,
"| citation_coverage | {} | {} | {} |",
@@ -419,6 +431,7 @@ fn build_deltas(
}
let mut hit = serde_json::Map::new();
let mut recall = serde_json::Map::new();
let mut precision = serde_json::Map::new();
for k in crate::metrics::TOP_K_VARIANTS {
hit.insert(
k.to_string(),
@@ -434,11 +447,19 @@ fn build_deltas(
b.recall_at_k_doc.get(k).copied().unwrap_or(f32::NAN),
),
);
precision.insert(
k.to_string(),
d(
a.precision_at_k_chunk.get(k).copied().unwrap_or(f32::NAN),
b.precision_at_k_chunk.get(k).copied().unwrap_or(f32::NAN),
),
);
}
serde_json::json!({
"hit_at_k": hit,
"mrr": d(a.mrr, b.mrr),
"recall_at_k_doc": recall,
"precision_at_k_chunk": precision,
"citation_coverage": d(a.citation_coverage, b.citation_coverage),
"groundedness": d(a.groundedness, b.groundedness),
"empty_result_rate": d(a.empty_result_rate, b.empty_result_rate),
@@ -484,6 +505,7 @@ mod tests {
hit_at_k: Default::default(),
mrr: 0.5,
recall_at_k_doc: Default::default(),
precision_at_k_chunk: Default::default(),
citation_coverage: f32::NAN,
groundedness: 0.0,
empty_result_rate: 0.0,

View File

@@ -58,6 +58,14 @@ pub struct AggregateMetrics {
pub hit_at_k: BTreeMap<u32, f32>,
pub mrr: f32,
pub recall_at_k_doc: BTreeMap<u32, f32>,
/// p9-fb-39: chunk-level precision at k. Binary relevance via
/// `expected_chunk_ids` (a hit is "relevant" if its chunk_id is
/// in the golden's `expected_chunk_ids`). Denominator is k (fixed)
/// — `hits.len() < k` still divides by k, treating shortfall as
/// precision loss (mirrors `hit_at_k`). Queries with empty
/// `expected_chunk_ids` are skipped (mirrors `hit_at_k_chunk`).
#[serde(default)]
pub precision_at_k_chunk: BTreeMap<u32, f32>,
#[serde(
serialize_with = "serialize_f32_nan_as_null",
deserialize_with = "deserialize_f32_or_nan"
@@ -187,6 +195,8 @@ pub(crate) fn aggregate_from_rows(
TOP_K_VARIANTS.iter().map(|k| (*k, (0_u32, 0_u32))).collect();
let mut recall_at_k_doc: BTreeMap<u32, (f64, u32)> =
TOP_K_VARIANTS.iter().map(|k| (*k, (0.0_f64, 0_u32))).collect();
let mut precision_at_k_chunk: BTreeMap<u32, (f64, u32)> =
TOP_K_VARIANTS.iter().map(|k| (*k, (0.0_f64, 0_u32))).collect();
let mut mrr_sum: f64 = 0.0;
let mut mrr_denom: u32 = 0;
@@ -243,6 +253,18 @@ pub(crate) fn aggregate_from_rows(
{
mrr_sum += 1.0 / f64::from(rank);
}
// p9-fb-39: precision@k_chunk — count of top-k hits whose
// chunk_id is in `expected`, divided by k (fixed denominator).
for k in TOP_K_VARIANTS {
let hits_in_topk_relevant = qr
.hits_top_k
.iter()
.filter(|h| h.rank <= *k && expected.contains(&h.chunk_id))
.count();
let entry = precision_at_k_chunk.get_mut(k).expect("init");
entry.0 += hits_in_topk_relevant as f64 / f64::from(*k);
entry.1 += 1;
}
}
// recall@k_doc (doc-level, requires non-empty expected_doc_ids
@@ -316,7 +338,8 @@ pub(crate) fn aggregate_from_rows(
| Citation::Page { path, .. }
| Citation::Region { path, .. }
| Citation::Caption { path, .. }
| Citation::Time { path, .. } => !path.0.is_empty(),
| Citation::Time { path, .. }
| Citation::Code { path, .. } => !path.0.is_empty(),
});
if covered {
citation_num += 1;
@@ -333,6 +356,7 @@ pub(crate) fn aggregate_from_rows(
mrr_sum / f64::from(mrr_denom)
}),
recall_at_k_doc: round_recall_map(&recall_at_k_doc),
precision_at_k_chunk: round_recall_map(&precision_at_k_chunk),
citation_coverage: ratio_or_nan(citation_num, citation_denom),
groundedness: ratio_or_zero(groundedness_num, groundedness_denom),
empty_result_rate: ratio_or_zero(empty_result_count, total_queries),
@@ -444,6 +468,13 @@ mod tests {
index_version: IndexVersion(format!("idx@{rank}")),
embedding_model: None,
chunker_version: ChunkerVersion("test@1".into()),
// fb-32: synthetic eval fixtures don't exercise staleness;
// pin UNIX_EPOCH + stale=false so hits stay deterministic.
indexed_at: OffsetDateTime::UNIX_EPOCH,
stale: false,
score_kind: kebab_core::ScoreKind::Rrf,
repo: None,
code_lang: None,
}
}
@@ -479,6 +510,9 @@ mod tests {
end: 1,
section: None,
},
// fb-32: synthetic eval citations don't exercise staleness.
indexed_at: OffsetDateTime::UNIX_EPOCH,
stale: false,
}).collect(),
grounded,
refusal_reason: None,
@@ -666,4 +700,114 @@ mod tests {
assert_eq!(agg.failed_queries, 1);
assert_eq!(agg.total_queries, 1);
}
#[test]
fn precision_at_k_chunk_field_default_empty_on_old_json() {
// Old eval_runs.metrics_json predates fb-39 — no precision_at_k_chunk field.
// serde(default) yields empty BTreeMap.
let old = serde_json::json!({
"hit_at_k": {"1": 0.5, "3": 0.5, "5": 0.5, "10": 0.5},
"mrr": 0.5,
"recall_at_k_doc": {"1": 0.0, "3": 0.0, "5": 0.0, "10": 0.0},
"citation_coverage": null,
"groundedness": 0.0,
"empty_result_rate": 0.0,
"refusal_correctness": null,
"total_queries": 1,
"failed_queries": 0
});
let parsed: AggregateMetrics =
serde_json::from_value(old).expect("backwards-compat deserialize");
assert!(parsed.precision_at_k_chunk.is_empty());
}
#[test]
fn precision_at_k_chunk_exact_match() {
// expected = [c1, c2, c3]. Top-5 hits: [c1@1, c2@2, c3@3, x@4, y@5].
// P@5 = 3/5 = 0.6. P@10 = 3/10 = 0.3.
let queries = vec![gq("q1", &["c1", "c2", "c3"], &["d1"])];
let rows = vec![record(
"q1",
vec![
hit(1, "c1", "d1"),
hit(2, "c2", "d1"),
hit(3, "c3", "d1"),
hit(4, "x", "d1"),
hit(5, "y", "d1"),
],
None,
None,
)];
let agg = aggregate_from_rows(&queries, &rows).unwrap();
assert_eq!(agg.precision_at_k_chunk[&5], 0.6);
assert_eq!(agg.precision_at_k_chunk[&10], 0.3);
}
#[test]
fn precision_at_k_chunk_partial_topk_divides_by_k() {
// expected = [c1, c2]. Hits: only [c1@1, c2@2, x@3] (3 results).
// P@5 = 2/5 = 0.4 (denominator is k, not hits.len()).
let queries = vec![gq("q1", &["c1", "c2"], &["d1"])];
let rows = vec![record(
"q1",
vec![hit(1, "c1", "d1"), hit(2, "c2", "d1"), hit(3, "x", "d1")],
None,
None,
)];
let agg = aggregate_from_rows(&queries, &rows).unwrap();
assert_eq!(agg.precision_at_k_chunk[&5], 0.4);
assert_eq!(agg.precision_at_k_chunk[&10], 0.2);
}
#[test]
fn precision_at_k_chunk_zero_relevant_in_topk() {
// expected = [c1]. Hits: [x@1, y@2, z@3] (none relevant).
// P@5 = 0/5 = 0.0.
let queries = vec![gq("q1", &["c1"], &["d1"])];
let rows = vec![record(
"q1",
vec![hit(1, "x", "d1"), hit(2, "y", "d1"), hit(3, "z", "d1")],
None,
None,
)];
let agg = aggregate_from_rows(&queries, &rows).unwrap();
assert_eq!(agg.precision_at_k_chunk[&5], 0.0);
}
#[test]
fn precision_at_k_chunk_empty_expected_skipped() {
// expected_chunk_ids = []. Skipped → final BTreeMap entry value = 0.0
// (zero-denom path in round_recall_map). Mirrors recall_at_k_doc behavior.
let queries = vec![gq("q1", &[], &["d1"])];
let rows = vec![record("q1", vec![hit(1, "c1", "d1")], None, None)];
let agg = aggregate_from_rows(&queries, &rows).unwrap();
assert_eq!(agg.precision_at_k_chunk[&5], 0.0);
}
#[test]
fn precision_at_k_chunk_two_queries_averaged() {
// q1: expected=[c1], hits=[c1@1, x@2, y@3] → P@5 = 1/5 = 0.2
// q2: expected=[c1, c2], hits=[c1@1, c2@2] → P@5 = 2/5 = 0.4
// Avg P@5 = 0.3.
let queries = vec![
gq("q1", &["c1"], &["d1"]),
gq("q2", &["c1", "c2"], &["d2"]),
];
let rows = vec![
record(
"q1",
vec![hit(1, "c1", "d1"), hit(2, "x", "d1"), hit(3, "y", "d1")],
None,
None,
),
record(
"q2",
vec![hit(1, "c1", "d2"), hit(2, "c2", "d2")],
None,
None,
),
];
let agg = aggregate_from_rows(&queries, &rows).unwrap();
assert_eq!(agg.precision_at_k_chunk[&5], 0.3);
}
}

View File

@@ -11,6 +11,12 @@
"5": 0.666700005531311
},
"mrr": 0.41670000553131104,
"precision_at_k_chunk": {
"1": 0.33329999446868896,
"10": 0.06669999659061432,
"3": 0.11110000312328339,
"5": 0.13330000638961792
},
"recall_at_k_doc": {
"1": 0.33329999446868896,
"10": 0.666700005531311,
@@ -32,6 +38,12 @@
"5": 1.0
},
"mrr": 0.833299994468689,
"precision_at_k_chunk": {
"1": 0.666700005531311,
"10": 0.10000000149011612,
"3": 0.33329999446868896,
"5": 0.20000000298023224
},
"recall_at_k_doc": {
"1": 0.666700005531311,
"10": 1.0,
@@ -53,6 +65,12 @@
"5": 0.33329999446868896
},
"mrr": 0.41659998893737793,
"precision_at_k_chunk": {
"1": 0.33340001106262207,
"10": 0.0333000048995018,
"3": 0.22219999134540558,
"5": 0.06669999659061432
},
"recall_at_k_doc": {
"1": 0.33340001106262207,
"10": 0.33329999446868896,

View File

@@ -82,6 +82,13 @@ fn hit(rank: u32, chunk_id: &str, doc_id: &str) -> SearchHit {
index_version: IndexVersion("idx@1".into()),
embedding_model: None,
chunker_version: ChunkerVersion("test@1".into()),
// fb-32: synthetic eval fixtures don't exercise staleness;
// pin UNIX_EPOCH + stale=false so hits stay deterministic.
indexed_at: OffsetDateTime::UNIX_EPOCH,
stale: false,
score_kind: kebab_core::ScoreKind::Rrf,
repo: None,
code_lang: None,
}
}
@@ -198,6 +205,7 @@ fn store_aggregate_rejects_missing_run() {
hit_at_k: Default::default(),
mrr: 0.0,
recall_at_k_doc: Default::default(),
precision_at_k_chunk: Default::default(),
citation_coverage: f32::NAN,
groundedness: 0.0,
empty_result_rate: 0.0,

View File

@@ -213,7 +213,7 @@ fn runner_records_config_snapshot_with_versions() {
assert!(snap.pointer("/llm/model_id").is_some());
assert_eq!(
snap.pointer("/prompt_template_version"),
Some(&serde_json::Value::String("rag-v1".to_string())),
Some(&serde_json::Value::String("rag-v2".to_string())),
);
assert!(snap.pointer("/score_gate").is_some());
assert!(snap.pointer("/rrf_k").is_some());
@@ -336,21 +336,29 @@ fn runner_lexical_is_deterministic_per_query_payload() {
"- id: q1\n query: ownership\n- id: q2\n query: heading\n",
);
let run_a = run_with_golden(&yaml, || {
let mut run_a = run_with_golden(&yaml, || {
run_eval_with_config(&env.config, &lexical_opts()).unwrap()
});
let run_b = run_with_golden(&yaml, || {
let mut run_b = run_with_golden(&yaml, || {
run_eval_with_config(&env.config, &lexical_opts()).unwrap()
});
// Run-level fields (`run_id`, `created_at`) intentionally diverge;
// the per-query payload (which is what the snapshot fixture pins)
// must be byte-identical.
// must be byte-identical EXCEPT for `elapsed_ms`. Timing-sensitive
// fields aren't determinism signals — they're µs-scale wall-clock
// jitter and would otherwise make this assertion a flaky one (a 0
// vs 1 ms divergence was observed under contended-CI load). Normalize
// before comparing; see test #7 for the same exclusion done via a
// projection.
for qr in run_a.per_query.iter_mut().chain(run_b.per_query.iter_mut()) {
qr.elapsed_ms = 0;
}
let a_json = serde_json::to_string(&run_a.per_query).unwrap();
let b_json = serde_json::to_string(&run_b.per_query).unwrap();
assert_eq!(
a_json, b_json,
"lexical-only per_query payload must be byte-identical across runs"
"lexical-only per_query payload must be byte-identical across runs (timing normalized)"
);
}

View File

@@ -0,0 +1,29 @@
[package]
name = "kebab-mcp"
edition = { workspace = true }
rust-version = { workspace = true }
license = { workspace = true }
repository = { workspace = true }
version = { workspace = true }
[dependencies]
rmcp = { workspace = true }
# rt-multi-thread + io-util + io-std extend the workspace tokio entry
# (which only declares rt + macros) for the blocking stdio MCP transport.
tokio = { workspace = true, features = ["rt-multi-thread", "macros", "io-util", "io-std"] }
serde = { workspace = true }
serde_json = { workspace = true }
anyhow = { workspace = true }
tracing = { workspace = true }
# schemars 1.x matches rmcp 1.6's ^1.0 requirement (verified via crates.io
# /dependencies endpoint — rmcp declares optional schemars = "^1.0").
schemars = "1"
time = { workspace = true }
kebab-app = { path = "../kebab-app" }
kebab-config = { path = "../kebab-config" }
kebab-core = { path = "../kebab-core" }
[dev-dependencies]
tempfile = { workspace = true }

View File

@@ -0,0 +1,22 @@
//! Map `anyhow::Error` returned by kebab-app facades to MCP
//! `CallToolResult` with `isError: true` + error.v1 JSON content.
use rmcp::model::{CallToolResult, Content};
use kebab_app::classify;
/// Convert an `anyhow::Error` to a `CallToolResult` with `isError: true`
/// and the serialised `error.v1` envelope as the text content.
pub fn to_tool_error(err: &anyhow::Error) -> CallToolResult {
let v1 = classify(err, false);
let body = serde_json::to_string(&v1).unwrap_or_else(|_| {
r#"{"schema_version":"error.v1","code":"generic","message":"serialize failed"}"#
.to_string()
});
CallToolResult::error(vec![Content::text(body)])
}
/// Wrap a successful wire-schema JSON string as a `CallToolResult`.
pub fn to_tool_success(json: String) -> CallToolResult {
CallToolResult::success(vec![Content::text(json)])
}

213
crates/kebab-mcp/src/lib.rs Normal file
View File

@@ -0,0 +1,213 @@
//! MCP (Model Context Protocol) server over stdio. Exposes 8 tools
//! (`search` / `ask` / `schema` / `doctor` / `ingest_file` / `ingest_stdin`
//! / `fetch` / `bulk_search`) backed by `kebab-app` facade methods. Used by
//! `kebab-cli`'s `Cmd::Mcp` arm.
//!
//! See spec `docs/superpowers/specs/2026-05-07-p9-fb-30-mcp-server-design.md`.
use std::path::PathBuf;
use anyhow::Result;
use rmcp::ServerHandler;
use rmcp::handler::server::common::{schema_for_empty_input, schema_for_type};
use rmcp::model::{
CallToolRequestParams, CallToolResult, Implementation, ListToolsResult, ServerCapabilities,
ServerInfo, Tool,
};
use rmcp::service::{RequestContext, ServiceExt};
use rmcp::transport::stdio;
use rmcp::{ErrorData, RoleServer};
use kebab_config::Config;
pub mod error;
pub mod state;
pub mod tools;
pub use state::KebabAppState;
/// Build the canonical list of tools exposed by the MCP server.
///
/// Extracted from [`ServerHandler::list_tools`] so it can be called
/// directly in tests without constructing a `RequestContext`.
pub fn build_tools_vec() -> Vec<Tool> {
vec![
Tool::new(
"schema",
"Introspection — wire schemas, capabilities, model versions, index stats.",
schema_for_empty_input(),
),
Tool::new(
"doctor",
"Health check — verifies config, storage, models, and Ollama connectivity.",
schema_for_empty_input(),
),
Tool::new(
"search",
"Full-text / vector / hybrid search over the knowledge base. Returns search_hit.v1 array.",
schema_for_type::<tools::search::SearchInput>(),
),
Tool::new(
"ask",
"RAG question answering over the knowledge base. Returns answer.v1 JSON. Pass session_id for multi-turn context.",
schema_for_type::<tools::ask::AskInput>(),
),
Tool::new(
"ingest_file",
"Ingest a single file (path) into the knowledge base. Workspace external paths allowed — bytes are copied into _external/.",
schema_for_type::<tools::ingest_file::IngestFileInput>(),
),
Tool::new(
"ingest_stdin",
"Ingest markdown content into the knowledge base. v1 markdown only. Frontmatter (title + source_uri) auto-injected.",
schema_for_type::<tools::ingest_stdin::IngestStdinInput>(),
),
Tool::new(
"fetch",
"Verbatim fetch — chunk / doc / span modes. Returns fetch_result.v1 with the indexed text (no LLM rewrite).",
schema_for_type::<tools::fetch::FetchInput>(),
),
Tool::new(
"bulk_search",
"Bulk multi-query search — N queries per call (cap 100). Each query mirrors the `search` input shape; returns `bulk_search_response.v1` with per-query results + summary. Sequential execution reuses one App instance so cache / embedder cold-start cost amortizes.",
schema_for_type::<tools::bulk_search::BulkSearchInput>(),
),
]
}
#[derive(Clone)]
pub struct KebabHandler {
state: KebabAppState,
}
impl KebabHandler {
pub fn new(state: KebabAppState) -> Self {
Self { state }
}
pub fn state(&self) -> &KebabAppState {
&self.state
}
/// Spawn a tool handler on the blocking pool. Used by tools that
/// transitively touch reqwest::blocking::Client (search, ask) — calling
/// from the async dispatch directly panics inside the runtime.
async fn spawn_tool<I, F>(
&self,
args: serde_json::Map<String, serde_json::Value>,
handle: F,
) -> Result<CallToolResult, ErrorData>
where
I: serde::de::DeserializeOwned + Send + 'static,
F: FnOnce(KebabAppState, I) -> CallToolResult + Send + 'static,
{
let input: I = match serde_json::from_value(serde_json::Value::Object(args)) {
Ok(i) => i,
Err(e) => return Ok(error::to_tool_error(&anyhow::Error::from(e))),
};
let state = self.state.clone();
tokio::task::spawn_blocking(move || handle(state, input))
.await
.map_err(|e| ErrorData::internal_error(e.to_string(), None))
}
}
impl ServerHandler for KebabHandler {
fn get_info(&self) -> ServerInfo {
ServerInfo::new(ServerCapabilities::builder().enable_tools().build())
.with_server_info(Implementation::new("kebab", env!("CARGO_PKG_VERSION")))
}
async fn list_tools(
&self,
_request: Option<rmcp::model::PaginatedRequestParams>,
_context: RequestContext<RoleServer>,
) -> Result<ListToolsResult, ErrorData> {
Ok(ListToolsResult::with_all_items(build_tools_vec()))
}
async fn call_tool(
&self,
request: CallToolRequestParams,
_context: RequestContext<RoleServer>,
) -> Result<CallToolResult, ErrorData> {
match request.name.as_ref() {
"schema" => {
let input = tools::schema::SchemaInput::default();
Ok(tools::schema::handle(&self.state, input))
}
"doctor" => {
let input = tools::doctor::DoctorInput::default();
Ok(tools::doctor::handle(&self.state, input))
}
"search" => {
let args = request.arguments.unwrap_or_default();
self.spawn_tool(args, |state, input| {
tools::search::handle(&state, input)
})
.await
}
"ask" => {
let args = request.arguments.unwrap_or_default();
self.spawn_tool(args, |state, input| {
tools::ask::handle(&state, input)
})
.await
}
"ingest_file" => {
let args = request.arguments.unwrap_or_default();
self.spawn_tool(args, |state, input| {
tools::ingest_file::handle(&state, input)
})
.await
}
"ingest_stdin" => {
let args = request.arguments.unwrap_or_default();
self.spawn_tool(args, |state, input| {
tools::ingest_stdin::handle(&state, input)
})
.await
}
"fetch" => {
let args = request.arguments.unwrap_or_default();
self.spawn_tool(args, |state, input| {
tools::fetch::handle(&state, input)
})
.await
}
"bulk_search" => {
let args = request.arguments.unwrap_or_default();
self.spawn_tool(args, |state, input| {
tools::bulk_search::handle(&state, input)
})
.await
}
_other => Err(ErrorData::method_not_found::<
rmcp::model::CallToolRequestMethod,
>()),
}
}
}
/// Run the MCP server on stdio JSON-RPC. Blocks until the client closes
/// the stream (typically when the agent host exits).
///
/// `config_path` is the path passed via `--config <path>`, if any.
/// It is forwarded to `KebabAppState` so the doctor tool can honour the
/// same config file the server was started with (falls back to XDG default
/// when `None`).
pub fn serve_stdio(cfg: Config, config_path: Option<PathBuf>) -> Result<()> {
let runtime = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()?;
runtime.block_on(serve_stdio_async(cfg, config_path))
}
async fn serve_stdio_async(cfg: Config, config_path: Option<PathBuf>) -> Result<()> {
tracing::info!("kebab-mcp: starting stdio server");
let state = KebabAppState::new(cfg, config_path);
let handler = KebabHandler::new(state);
let service = handler.serve(stdio()).await?;
service.waiting().await?;
Ok(())
}

View File

@@ -0,0 +1,26 @@
//! Long-lived server state — holds Config so per-request handlers don't
//! reload from disk. Future: cache opened SqliteStore / Lance handles
//! here so first tool call pays the cost, subsequent calls hit warm
//! state.
use std::path::PathBuf;
use std::sync::Arc;
use kebab_config::Config;
#[derive(Clone)]
pub struct KebabAppState {
pub config: Arc<Config>,
/// `--config <path>` from CLI when present, else `None` (XDG default
/// fallback applies in `doctor_with_config_path`).
pub config_path: Option<PathBuf>,
}
impl KebabAppState {
pub fn new(config: Config, config_path: Option<PathBuf>) -> Self {
Self {
config: Arc::new(config),
config_path,
}
}
}

View File

@@ -0,0 +1,68 @@
//! `ask` tool — wraps `kebab_app::ask_with_config` (single-shot) or
//! `kebab_app::ask_with_session_with_config` when `session_id` is provided.
//! Input: { query, session_id?, mode? }. Output: answer.v1 JSON.
//!
//! `Answer` (kebab-core) does NOT carry a `schema_version` field; we tag
//! it inline here, matching the pattern from `search.rs`.
use rmcp::model::CallToolResult;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::error::{to_tool_error, to_tool_success};
use crate::state::KebabAppState;
#[derive(Debug, Deserialize, Serialize, JsonSchema)]
pub struct AskInput {
/// The user question.
pub query: String,
/// Optional session id for multi-turn RAG context.
pub session_id: Option<String>,
/// Optional retrieval mode override ("lexical" / "vector" / "hybrid"). Default "hybrid".
pub mode: Option<String>,
}
pub fn handle(state: &KebabAppState, input: AskInput) -> CallToolResult {
let mode = match input.mode.as_deref() {
Some("lexical") => kebab_core::SearchMode::Lexical,
Some("vector") => kebab_core::SearchMode::Vector,
_ => kebab_core::SearchMode::Hybrid, // default + "hybrid" + unknown
};
let opts = kebab_app::AskOpts {
k: 10,
explain: false,
mode,
temperature: None,
seed: None,
stream_sink: None,
history: Vec::new(),
conversation_id: None,
turn_index: None,
};
let cfg_clone = (*state.config).clone();
let result = match input.session_id {
Some(sid) => {
kebab_app::ask_with_session_with_config(cfg_clone, &sid, &input.query, opts)
}
None => kebab_app::ask_with_config(cfg_clone, &input.query, opts),
};
match result {
Ok(answer) => {
// `Answer` does not carry `schema_version`; tag inline (idempotent
// via entry().or_insert_with in case a future version adds it).
let mut v = match serde_json::to_value(&answer) {
Ok(v) => v,
Err(e) => return to_tool_error(&anyhow::anyhow!("answer serialize failed: {e}")),
};
if let serde_json::Value::Object(ref mut map) = v {
map.entry("schema_version".to_string())
.or_insert_with(|| serde_json::Value::String("answer.v1".to_string()));
}
match serde_json::to_string(&v) {
Ok(json) => to_tool_success(json),
Err(e) => to_tool_error(&anyhow::anyhow!(e)),
}
}
Err(e) => to_tool_error(&e),
}
}

View File

@@ -0,0 +1,55 @@
//! `bulk_search` tool — wraps `kebab_app::bulk_search_with_config`.
//! Input: `{ queries: [<SearchInput shape>, ...] }`.
//! Output: `bulk_search_response.v1` envelope (results + summary).
use rmcp::model::CallToolResult;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::error::{to_tool_error, to_tool_success};
use crate::state::KebabAppState;
#[derive(Debug, Deserialize, Serialize, JsonSchema)]
pub struct BulkSearchInput {
/// Per-query inputs. Each item mirrors the single-query `search`
/// tool's input shape — `query` is required, all other fields are
/// optional and default to single-search defaults. Capped at 100
/// items; exceeding returns an `invalid_input` tool error without
/// running any query.
pub queries: Vec<serde_json::Value>,
}
pub fn handle(state: &KebabAppState, input: BulkSearchInput) -> CallToolResult {
let cfg_clone = (*state.config).clone();
match kebab_app::bulk_search_with_config(cfg_clone, input.queries) {
Ok((items, summary)) => {
let tagged_items: Vec<serde_json::Value> = items
.iter()
.map(|it| {
let mut v = serde_json::to_value(it).unwrap_or(serde_json::Value::Null);
if let serde_json::Value::Object(ref mut map) = v {
map.insert(
"schema_version".to_string(),
serde_json::Value::String("bulk_search_item.v1".to_string()),
);
}
v
})
.collect();
let envelope = serde_json::json!({
"schema_version": "bulk_search_response.v1",
"results": tagged_items,
"summary": {
"total": summary.total,
"succeeded": summary.succeeded,
"failed": summary.failed,
},
});
match serde_json::to_string(&envelope) {
Ok(json) => to_tool_success(json),
Err(e) => to_tool_error(&anyhow::anyhow!(e)),
}
}
Err(e) => to_tool_error(&e),
}
}

View File

@@ -0,0 +1,28 @@
//! `doctor` tool — wraps `kebab_app::doctor_with_config_path`.
//! Input: {} (no args). Output: doctor.v1 JSON.
//!
//! `doctor_with_config_path(Option<&Path>)` re-reads config from disk so
//! the report reflects the live file state. We forward `config_path` from
//! `KebabAppState` so `--config <path>` users see results for their file;
//! callers that pass `None` fall back to the XDG default (same as the CLI
//! bare `kebab doctor`).
use rmcp::model::CallToolResult;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::error::{to_tool_error, to_tool_success};
use crate::state::KebabAppState;
#[derive(Debug, Default, Deserialize, Serialize, JsonSchema)]
pub struct DoctorInput {}
pub fn handle(state: &KebabAppState, _input: DoctorInput) -> CallToolResult {
match kebab_app::doctor_with_config_path(state.config_path.as_deref()) {
Ok(report) => match serde_json::to_string(&report) {
Ok(json) => to_tool_success(json),
Err(e) => to_tool_error(&anyhow::anyhow!(e)),
},
Err(e) => to_tool_error(&e),
}
}

View File

@@ -0,0 +1,99 @@
//! p9-fb-35 `fetch` tool — wraps `kebab_app::fetch_with_config`.
//!
//! Three modes (chunk / doc / span). Output is `fetch_result.v1`.
//!
//! Mirrors the CLI surface (`kebab fetch <kind> ...`): same input shape,
//! same wire envelope. Missing kind-specific fields produce an `error.v1`
//! with `code = "invalid_input"`.
use rmcp::model::CallToolResult;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::error::{to_tool_error, to_tool_success};
use crate::state::KebabAppState;
#[derive(Debug, Deserialize, Serialize, JsonSchema)]
pub struct FetchInput {
/// "chunk" | "doc" | "span"
pub kind: String,
/// Required when kind = "chunk".
pub chunk_id: Option<String>,
/// Required when kind = "doc" or "span".
pub doc_id: Option<String>,
/// Required when kind = "span" (1-based, inclusive).
pub line_start: Option<u32>,
pub line_end: Option<u32>,
/// chunk only: ±N surrounding chunks.
pub context: Option<u32>,
/// doc/span only: chars/4 budget.
pub max_tokens: Option<usize>,
}
pub fn handle(state: &KebabAppState, input: FetchInput) -> CallToolResult {
let query = match input.kind.as_str() {
"chunk" => match input.chunk_id {
Some(id) => kebab_core::FetchQuery::Chunk(kebab_core::ChunkId(id)),
None => return invalid_input("kind=chunk requires chunk_id"),
},
"doc" => match input.doc_id {
Some(id) => kebab_core::FetchQuery::Doc(kebab_core::DocumentId(id)),
None => return invalid_input("kind=doc requires doc_id"),
},
"span" => match (input.doc_id, input.line_start, input.line_end) {
(Some(id), Some(start), Some(end)) => kebab_core::FetchQuery::Span {
doc_id: kebab_core::DocumentId(id),
line_start: start,
line_end: end,
},
_ => return invalid_input("kind=span requires doc_id, line_start, line_end"),
},
other => {
return invalid_input(&format!(
"unknown kind '{other}'; expected chunk|doc|span"
));
}
};
let opts = kebab_core::FetchOpts {
context: input.context,
max_tokens: input.max_tokens,
};
let cfg_clone = (*state.config).clone();
match kebab_app::fetch_with_config(cfg_clone, query, opts) {
Ok(r) => {
// FetchResult does not carry a `schema_version` field, so we
// tag the envelope inline (mirrors search.rs's pattern).
let mut v = match serde_json::to_value(&r) {
Ok(v) => v,
Err(e) => {
return to_tool_error(&anyhow::anyhow!("FetchResult serialize: {e}"));
}
};
if let serde_json::Value::Object(ref mut map) = v {
map.insert(
"schema_version".to_string(),
serde_json::Value::String("fetch_result.v1".to_string()),
);
}
match serde_json::to_string(&v) {
Ok(json) => to_tool_success(json),
Err(e) => to_tool_error(&anyhow::anyhow!(e)),
}
}
Err(e) => to_tool_error(&e),
}
}
fn invalid_input(msg: &str) -> CallToolResult {
use kebab_app::{ErrorV1, StructuredError};
let err = anyhow::Error::new(StructuredError(ErrorV1 {
schema_version: "error.v1".to_string(),
code: "invalid_input".to_string(),
message: msg.to_string(),
details: serde_json::Value::Null,
hint: None,
}));
to_tool_error(&err)
}

View File

@@ -0,0 +1,39 @@
//! `ingest_file` tool — wraps `kebab_app::ingest_file_with_config`.
//! Input: { path }. Output: ingest_report.v1 JSON.
use std::path::PathBuf;
use rmcp::model::CallToolResult;
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
use crate::error::{to_tool_error, to_tool_success};
use crate::state::KebabAppState;
#[derive(Debug, Deserialize, Serialize, JsonSchema)]
pub struct IngestFileInput {
/// Absolute or relative path to the file to ingest. Workspace external
/// paths are allowed — bytes are copied into `_external/`.
pub path: String,
}
pub fn handle(state: &KebabAppState, input: IngestFileInput) -> CallToolResult {
let cfg_clone = (*state.config).clone();
let path = PathBuf::from(input.path);
match kebab_app::ingest_file_with_config(cfg_clone, &path) {
Ok(report) => match serde_json::to_value(&report) {
Ok(mut v) => {
if let serde_json::Value::Object(ref mut map) = v {
map.entry("schema_version".to_string())
.or_insert_with(|| serde_json::Value::String("ingest_report.v1".to_string()));
}
match serde_json::to_string(&v) {
Ok(json) => to_tool_success(json),
Err(e) => to_tool_error(&anyhow::anyhow!(e)),
}
}
Err(e) => to_tool_error(&anyhow::anyhow!(e)),
},
Err(e) => to_tool_error(&e),
}
}

Some files were not shown because too many files have changed in this diff Show More