Add `SqliteStore::code_lang_breakdown()` that queries
`json_extract(metadata_json, '$.code_lang')`, groups by it, and
skips NULL rows — returning `BTreeMap<String, u32>`.
Wire it into `collect_stats` in `kebab-app::schema`, replacing the
`BTreeMap::new()` placeholder inserted by 1A-1.
Test: `store::tests::code_lang_breakdown_counts_by_code_lang` asserts
rust=1 and that a null-code_lang doc does NOT appear in the map.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `MediaType::Code("rust")` dispatch arm in `ingest_one_asset`,
`ingest_one_code_asset` fn (faithful mirror of `ingest_one_pdf_asset`),
and `backfill_code_lang` post-processing in `App::search_uncached`.
Integration test `code_ingest_smoke.rs` verifies full pipeline:
ingest `.rs` → Citation::Code hit with lang/symbol/line_start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDD: red → green cycle confirmed. New `Code(String)` variant serializes
as `{"code":"rust"}` via serde `rename_all = "lowercase"`. All exhaustive
`match` sites updated (`media_label`, `ingest_one_asset` catch-all →
explicit or-pattern). Design §3.5 enum listing synced. Also fix
`/target` symlink gitignore pattern so integration-test binary lookup
via workspace-relative path works with CARGO_TARGET_DIR redirect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 13: add wire regression tests proving markdown SearchHit omits
repo/code_lang when None, and all 5 original Citation variants serialize
byte-identically without spurious Code-variant keys.
Task 15: add --repo (repeatable) and --code-lang (repeatable,
comma-separated) flags to `kebab search`; propagate both into
SearchFilters instead of the previous vec![] stub. Add
#[allow(clippy::large_enum_variant)] — Cmd is short-lived, boxing buys
nothing.
Task 16: add code_lang_breakdown and repo_breakdown BTreeMap fields to
Stats (schema.v1); derive Default on Stats; populate both as empty in
collect_stats (1A-2 fills them when code chunks land). Add unit test
asserting both keys are present in the serialized object.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire kebab_parse_code::is_generated_file and is_oversized into
FsSourceConnector::scan_with_skips. Files that pass gitignore/builtin/
kebabignore matching are now checked for generated-file markers
(config-gated via ingest.code.skip_generated_header) and byte/line caps
(ingest.code.max_file_bytes / max_file_lines). FsScanSkips gains
skipped_generated + skipped_size_exceeded counters; kebab-app threads
them into IngestReport. Also fixes a pre-existing clippy::derivable_impls
warning in IngestCfg. Three new connector tests cover all three paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Refactor walker to expose WalkOverrides (combined + per-source matchers),
add walk_files_with_skips that returns accepted files alongside skip
attribution, wire FsSourceConnector::scan_with_skips into kebab-app so
IngestReport.skipped_gitignore, skipped_kebabignore, skipped_builtin_blacklist,
and skip_examples are populated instead of left at zero. Priority order
per spec §5.2 (builtin > gitignore > kebabignore) enforced in classify_skip,
with a directory-aware builtin matcher so pruned directory entries are
correctly attributed to builtin rather than a coincident gitignore entry.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds five new u32 counters (skipped_gitignore, skipped_kebabignore,
skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded)
and a SkipExamples struct (≤5 sample paths per category) to
IngestReport. All new fields are #[serde(default)] for backward-compat
deserialization. Downstream literal construction sites patched with
zeros/empty; snapshot re-baked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire two new optional fields onto SearchHit (skip_serializing_if = None)
and two Vec<String> filter fields onto SearchFilters (serde default).
Add RetrievalDetail::Default impl (manual, uses SearchMode::Hybrid as
sentinel). Patch all downstream SearchHit / SearchFilters literal
constructors with repo: None / code_lang: None / vec![] as appropriate.
Also covers Citation::Code arm in kebab-eval metrics match.
Also fixes App::search_with_opts trace branch to use NoopRetriever
for SearchMode::Lexical, removing the embeddings requirement when
the user only wants lexical-mode trace.
Adds the `trace: Option<SearchTrace>` field to `SearchResponse` and
threads `SearchOpts.trace` through `App::search_with_opts`. When the
caller sets `opts.trace = true` the path bypasses the LRU search cache
and runs through `HybridRetriever::search_with_trace`, which dispatches
all 3 SearchModes internally; this means `--trace` requires embeddings
(same constraint as `--mode hybrid`). The non-trace path keeps its
exact prior behavior with `trace: None` stamped on the response.
Picked up Task 1 / Task 3 follow-ups in the same commit so the
workspace compiles: SearchOpts struct-literals in kebab-cli/main.rs +
kebab-mcp/tools/search.rs default the new `trace` field to false, and
the schema-wrapper test in kebab-cli/wire.rs fills the new
media_breakdown / lang_breakdown / index_bytes / stale_doc_count fields
on Stats with `Default::default()`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count
fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold
for callers that need real stale counts. Mirrors all new fields onto the
wire-bound Stats struct in kebab-app::schema with #[serde(default)] for
backwards-compat. Also fixes search_budget_integration.rs for the trace field
added to SearchOpts in Task 1.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- wire schema: relax effective_end.minimum 1 → 0 + expand
description to cover line-clamp + out-of-range sentinel
(panic-fix R1 emits Some(0) when line_start=1 and range is
beyond doc end — schema must accept it)
- tests: tighten first-chunk-target boundary test to assert ≤ 2
total neighbors (3-chunk doc, N=2). Strict "first chunk →
context_before empty" not assertable until chunks.ordinal
column lands (R1 #9 architectural caveat)
- store: trim contradiction in list_chunk_ids_for_doc warning
comment — drop "good enough for sequentially chunked
markdown" phrase that conflicts with "hash sort dominates"
paragraph above
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Line-based slice over fmt_canonical_to_markdown output.
PDF / audio source_type → span_not_supported StructuredError.
Out-of-range line_end clamps to total; effective_end reflects
post-budget trim. invalid_input on zero / inverted bounds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks CanonicalDocument blocks, serializes to markdown, applies
chars/4 budget when opts.max_tokens is set. doc_not_found
preserved through StructuredError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Chunk mode + +-N context. doc / span modes return placeholder
errors (filled by subsequent tasks). fmt_canonical_to_markdown
helper introduced now since doc mode (Task 4) consumes it.
Errors are typed StructuredError so classify preserves
chunk_not_found / doc_not_found through the wire layer.
Adds SqliteStore::list_chunk_ids_for_doc so the facade can derive
+-N neighbors without leaking direct rusqlite usage into kebab-app.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous round-1 fix dropped the speculative cursor branch on
the truncated path, leaving a contradiction with the docs:
- snippet-only shrunk → cursor emitted (returned == k_effective)
- k-popped → cursor null (returned < k_effective)
But docs promised the opposite.
R2 resolution: emit cursor whenever more hits may be reachable
(either retriever filled the page OR budget popped hits — the
popped ones remain fetchable from offset+returned). Drop the
artificial "widen vs paginate" copy; truncated and next_cursor
are now independent signals — caller may do either or both.
Updates: app.rs::search_with_opts logic + SearchResponse doc +
schema description + SKILL.md two bullets + max_tokens=0 test
asserts cursor IS emitted on k-pop case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- error_wire: StructuredError wrapper preserves ErrorV1 through
anyhow → classify pipeline. Adds downcast short-circuit so
cursor::decode's typed code = "stale_cursor" reaches the wire
instead of being string-formatted to code = "generic".
- app: search_with_opts now wraps cursor::decode error in
StructuredError instead of anyhow! string format.
- test: error_wire pins both negative (bare anyhow → not
stale_cursor) AND positive (StructuredError → stale_cursor)
invariants. CLI integration test runs end-to-end and asserts
error.v1.code on stderr.
- app: next_cursor only emitted on full-page (k-pop) path; drop
speculative emit on snippet-only truncation that would point at
a different page than the agent expected.
- cursor: differentiate malformed-base64 / malformed-payload /
revision-mismatch error messages; all keep code = stale_cursor.
- test: cursor_rejected fixture uses .expect() to fail loud on
cursor non-emission instead of silent skip.
- test: max_tokens=0 → 1-hit floor + truncated=true.
- docs: SKILL.md + schema description distinguish snippet-shrink
(widen) vs k-pop (paginate) truncated cases. HOTFIXES notes
--no-cache semantic shift (cached path + clear vs uncached path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ErrorV1 is the workspace wire error struct; boxing here would
force every call site to deref through a Box for no win — the
err-path is rare. Single allow at the function level.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JSON output wrapped in search_response.v1 (breaking — agent must
adapt). Plain output unchanged + [truncated; use --cursor X]
stderr hint when budget tripped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Budget loop: snippet shorten → k pop → ≥1 hit floor. Cursor
encode/decode threads corpus_revision; mismatch surfaces as
stale_cursor anyhow error. App::search retained as thin wrapper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opaque base64(JSON{offset, corpus_revision}). Mismatch or
malformed input returns ErrorV1 with code = stale_cursor.
base64 promoted to workspace dep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker channel now carries kebab_app::StreamEvent. drain_stream
matches on Token { delta }; RetrievalDone and Final are ignored
(citations render from last_answer, Final is redundant with
worker join). app::AskState.rx type widened to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- kebab-app::staleness::compute_stale gains note pointing at the
kebab-rag mirror so future modifiers know to update both copies.
- kebab-rag::pipeline: doc comments adjacent to compute_stale and
embedding_ref_for were positioned such that rustdoc would
misattribute them. Reorder/separate so each comment hugs its
own function.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compute_stale: strict > boundary, threshold=0 disables, future
timestamps treated as fresh (clock skew safety). App::search
re-stamps on cache hit so config threshold changes take effect
without flushing the cache.
Also unblocks the workspace build by plugging placeholder
indexed_at/stale into the two AnswerCitation construction
sites in kebab-rag/pipeline.rs (the score-gate refusal path
forwards from SearchHit; the LLM-citation path uses
UNIX_EPOCH/false until Task 7 wires the real values through
pack_context).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ingest_file_with_config: lowercase normalize ext (caller-side) +
early error on unsupported extension (`.docx` etc. now Err with
helpful message instead of silent skipped_by_extension counter).
New test ingest_file_errors_on_unsupported_extension.
- ingest_stdin_with_config: doc comment explaining intentional
double-call of ensure helpers (idempotent + ~ms negligible).
- external::inject_frontmatter: simplify precheck via single
trim_start binding + add CR-only line ending edge case.
- external::inject_frontmatter: doc note on yaml_quote escape
contract (agent-supplied titles with special chars are safe).
Round 1 review summary: #111 (comment)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- kebab-app: add #[doc(hidden)] to ingest_stdin_with_config (CLAUDE.md
convention — all *_with_config functions should have this attribute;
fb-31's first impl missed it on the second facade fn).
- SKILL.md: "Since v0.4.0" → "Since v0.3.1" (MCP shipped in fb-30
release v0.3.1; the wrong version claim was introduced in fb-30 doc
sync and carried forward into fb-31).
- tools_call_ingest_file: add idempotency test (second call with same
content → unchanged=1, new=0). Spec called for two tests; first impl
shipped only the happy path.
Version bump 0.3.1 → 0.3.2 deferred to separate `chore/bump-v0.3.2` PR
mirroring fb-27 + fb-30 precedent (commits 73f5d73 / 5495d96).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps body with YAML frontmatter (title + source_uri) via
crate::external::inject_frontmatter, writes to
_external/<hash12>.md, delegates to ingest_file_with_config. Markdown
only in v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three scenarios — copies external md + reports new=1, idempotent on
second call (unchanged=1), errors on missing path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-file ingest entry. Copies bytes to _external/<hash12>.<ext>
via crate::external::copy_to_external, runs the per-medium pipeline on
that single asset (reuses ingest_with_config_opts via a SourceScope
{ root: _external/, include: [<filename>], exclude:
config.workspace.exclude }).
`.kebabignore` matches log a stderr warn line and proceed (explicit
ingest is bypass intent). Internal helper `check_kebabignore_match`
uses the `ignore` crate's GitignoreBuilder.
Returns the standard IngestReport (incremental ingest from fb-23
handles re-ingest as `unchanged`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds integration test schema_tool_emits_error_v1_when_db_missing that
verifies NotIndexed errors are emitted as error.v1 JSON with isError=true.
Also fixes ErrorV1 struct to include required schema_version field per
error.v1 wire contract (docs/wire-schema/v1/error.schema.json).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fb-30 의 새 crate `kebab-mcp` 가 동일 classify 모듈 사용 — UI crate 끼리
import 는 facade rule 위반이므로 kebab-app 으로 promotion. fb-27 commit
c91228e 의 코드 그대로 이전 (struct + classify + classify_llm + 7 unit
test). reqwest dev-dep 도 함께 이동.
kebab-cli 는 `kebab_app::ErrorV1` / `kebab_app::classify` 로 import 경로
1줄 변경 + wire.rs 의 `&crate::error_classify::ErrorV1` 1줄 교체. 동작
무영향.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- schema.rs: extract `SCHEMA_V1_ID` const + re-export via kebab-app::lib.rs.
wire.rs::wire_schema 의 2 literal 도 import 해서 single source of truth.
- schema.rs::collect_models: parser_version 가 markdown 만 surface 함을
주석으로 명시 (PDF/image extractor 의 자체 version 은 SchemaV1.models 가
multi-medium map 으로 진화 시 surface).
- main.rs::print_schema_text: 헤더 줄 끝의 `\n` 제거 + `println!()` 추가 —
다른 section 들과 패턴 일관.
- error_classify.rs::llm_unreachable_classifies: timeout 50ms → 500ms (10x
headroom) + 접근 방식 + 한계 주석 추가.
- HOTFIXES: open_existing 의 RW flag + 주석-only enforcement 갭을
Known-limitation 에 명시.
Round 1 review summary: #104 (comment)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plug coverage hole flagged in code review — test 1 was asserting only
doc_count + last_ingest_at, leaving count_summary's chunk_count and
asset_count queries un-pinned.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scenarios: freshly-ingested 2-doc KB (stats reflect counts +
last_ingest_at populated) and empty-but-initialized KB (counts zero,
last_ingest_at None). The empty case runs ingest_with_config over an
empty workspace dir to seed kebab.sqlite before calling schema_with_config,
since open_existing (used internally) returns NotIndexed if the DB is absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace kebab-app's private `KEBAB_PARSE_MD_VERSION` literal with a
direct reference to `kebab_parse_md::PARSER_VERSION` so the parser
version cascade has a single source of truth (design §9 invariant).
Add maintenance comment on schema.rs WIRE_SCHEMAS const pointing to
docs/wire-schema/v1/ + kebab-cli wire helpers as the authoritative
sources to keep in sync.
Tighten open_existing doc comment to match the actual SQLITE_OPEN_READ_WRITE
flag (needed for WAL pragma application) — callers should still avoid
issuing mutations through this connection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace `path.exists()` + `Connection::open` (which silently CREATEs on
race) with `Connection::open_with_flags` using READ_WRITE|URI but NOT
CREATE. SQLite surfaces `SQLITE_CANTOPEN` for missing files; we wrap as
NotIndexed { found: None } as before.
Adds open_existing_does_not_create_missing_db regression test pinning
the no-side-effect invariant.
Also documents read-only intent on open_existing, the format contract
on NotIndexed.found, and removes scaffolding comments from kebab-app
error_signal that are no longer load-bearing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `SqliteStore::open_existing` API + `NotIndexed` signal for the
missing-DB case. kebab-app re-exports the type via its `error_signal`
module so kebab-cli's `error_classify` can map it to
`error.v1 { code: "not_indexed" }`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps every error path in `Config::from_file` (read failure, TOML parse,
validation) so downstream callers can `downcast_ref::<ConfigInvalid>()`
to build the `error.v1` wire record. kebab-app re-exports the type via
its `error_signal` module.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>