Commit Graph

100 Commits

Author SHA1 Message Date
c19aa006d0 feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch
Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds
go_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=go, symbol=chunk.ParseDoc, code_lang=go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:13:47 +00:00
2559d0d95a refactor(p10-1c-go): add go to ingest dispatch allowlist (bail until Task F)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:03:28 +00:00
453ec15df4 fix(dogfood): document-centric fetch_span + remove get_asset_by_workspace_path
assets.workspace_path is INTENTIONALLY 'last-registered path' for twin
files (identical content at different paths share one asset row PK'd by
blake3 content hash). PR #146 made try_skip_unchanged document-centric;
PR #149 made reset --orphans-only document-centric; this PR removes the
last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span,
which used it to reject PDF/audio media — for twins this could read the
wrong asset's media_type and pick the wrong branch).

Replaced with the natural 2-step lookup: get_document_by_workspace_path
(PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id
is PRIMARY KEY so flip-flop-immune by construction).

Then removed get_asset_by_workspace_path trait method + SqliteStore impl
— 0 callers after the refactor.

UPSERT doc-comment refreshed in store.rs to make the 'last-registered'
semantics explicit so future readers don't try to 'fix' the flip-flop.

Dogfood follow-up (PR #142 1B + multi-root corpus).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:03:38 +00:00
5f2bd9e97e feat(dogfood): kebab reset --orphans-only — purge stored docs outside walker scope
PR #148 auto-purges only filesystem-missing files (conservative — leaves
on-disk-but-out-of-scope docs alone for data safety). This is the explicit
complement: when the user has narrowed include / widened exclude / removed
a sub-directory from the workspace and WANTS the stored docs reconciled,
they invoke 'kebab reset --orphans-only'.

Confirm prompt with orphan count + sample paths; --yes required in
non-TTY. SQLite purge via existing purge_deleted_workspace_path (PR #148)
+ vector store delete_by_chunk_ids when configured. No fs existence
check — orphans-only is the explicit 'I know what I'm doing' variant.

dogfood follow-up to PR #148 (file deletion auto-purge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:38:10 +00:00
d6d165df01 docs(dogfood): sync sweep_deleted_files algorithm doc with try_exists (PR #148 nit)
Round 2 review found the function-level doc-comment still referenced the
old fs::exists() (now replaced by try_exists().unwrap_or(true) in commit
2baa846). One-line clarification — describes the conservative-on-Err
semantics so future readers don't reintroduce the data-safety bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:10:27 +00:00
2baa846c6b fix(dogfood): conservative try_exists() in sweep_deleted_files (PR #148 review)
Round 1 review found a data-safety bug: fs::exists() returns false on
errors like EACCES / EPERM / NFS-hiccup / ownership-change, which would
trigger purge on a file that is in fact still on disk (just unreadable
this moment). Switched to try_exists().unwrap_or(true) so transient FS
errors are CONSERVATIVELY treated as 'file present' — never purge on
uncertain signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:04:03 +00:00
27baec82ea fix(dogfood): auto-purge stored docs for filesystem-deleted files
Files deleted from disk (rm a.md) were leaving stale documents + chunks +
embeddings in the store, surfacing as ghost citations in search/ask.
Existing purge_orphan_at_workspace_path only handled content-changed
stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no
new asset_id.

Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths),
for each candidate check filesystem existence — only purge when the
file is TRULY missing. Scope-narrowing case (file on disk but outside
include glob) is explicitly NOT purged to protect users from accidental
data loss via config edits.

Adds:
- DocumentStore::all_workspace_paths trait method + SqliteStore impl
- purge_deleted_workspace_path in store-sqlite (returns chunk_ids for
  vector delete; deletes doc CASCADE + asset row + copied storage file)
- sweep_deleted_files in kebab-app::ingest path; called once per ingest
  before the per-asset loop
- IngestReport.purged_deleted_files counter (additive, serde default)
- CLI ingest summary mentions purge count when > 0
- 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge

dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod
+ lodash). Per user decision: only filesystem deletion auto-purges;
scope narrowing requires explicit kebab reset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:51:07 +00:00
360f825f3a docs(dogfood): refresh try_skip_unchanged doc-comment to match new flow (PR #146 review)
Round 1 review found the function-level doc-comment still described the
old asset-side algorithm (item 2 asset-row checksum, item 3 id_for_doc
miss). Updated to the document-centric flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:35:17 +00:00
641b92af7d fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency
Identical-content files at different workspace paths share one assets row
(assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT
`ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made
twin files overwrite each other's workspace_path on every ingest, so
`get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or
None) — break idempotent unchanged-detection for both files.

Fix: switch try_skip_unchanged to document-centric lookup. `documents.
workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)`
includes path, so each twin has its own stable document row. Compare
`doc.source_asset_id` with the new asset's checksum instead of going
through the assets table.

Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of
726 docs marked Updated on every idempotent re-ingest — all 27 are
twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md
same content, duplicate logo PDFs/JPGs).

After: re-ingest reports 0 new / 0 updated / 726 unchanged.

No schema migration needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:27:21 +00:00
4e8b84c4e0 fix(dogfood): populate schema.v1.repo_breakdown (Task 9 follow-up)
Dogfooding (PR #142 1B + multi-root corpus: kebab-docs + httpx + zod + lodash)
revealed schema.v1.repo_breakdown is always {} despite the 1A-2 Task 9
having added the code_lang_breakdown sibling. The schema.rs:171 placeholder
`BTreeMap::new()` was left in place. Mirror Task 9's code_lang_breakdown
query for the repo field — same metadata_json JSON-path pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:09:19 +00:00
d53995a6d4 feat(p10-1b): code-js-ast-v1 chunker + activate JavaScript in app dispatch
Chunker: duplicate-with-substitution from code-ts-ast-v1 / code-rust-ast-v1.
Dispatch: replaces JS bail! arms with JavascriptAstExtractor + CodeJsAstV1Chunker.
Integration test javascript_file_ingests_and_searches_as_code_citation asserts
citation.lang=javascript, symbol=src/Bar.Bar.baz, code_lang=javascript.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:16:07 +00:00
31245a4328 fix(p10-1b): TS parser_version code-typescript-v1 → code-ts-v1 (naming consistency)
Task H implementer chose code-typescript-v1 but plan + design §3.3 use the
short form (chunker is code-ts-ast-v1 / code-js-ast-v1). Aligning parser
versions to match: rust=code-rust-v1 / python=code-python-v1 / ts=code-ts-v1
/ js=code-js-v1 (Task K). Fixes 2 sites: const PARSER_VERSION + integration
test assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:05:17 +00:00
acb61b6830 feat(p10-1b): activate TypeScript in ingest_one_code_asset dispatch
Replaces TS bail! arms with TypescriptAstExtractor + CodeTsAstV1Chunker.
Adds typescript_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=typescript, symbol=src/Foo.Foo.bar, code_lang=typescript.
JS arms remain bail!() (Task L).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:59:41 +00:00
1815091247 feat(p10-1b): activate Python in ingest_one_code_asset dispatch
Replaces Python bail! arms with PythonAstExtractor + CodePythonAstV1Chunker.
Adds python_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=python, symbol=kebab_eval.metrics.compute_mrr,
code_lang=python. TS/JS arms remain bail!() (Tasks J/L).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:49:01 +00:00
8bdb3e8090 refactor(p10-1b): generalize ingest_one_code_asset for multi-language dispatch
Rust path observably unchanged (verified by existing code_ingest_smoke tests).
Python/TS/JS arms bail with TODO; per-lang extractor + chunker land in subsequent tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:35:53 +00:00
b1d5047399 fix(p10-1a-2): PR review round 1 — doc inconsistencies + observable backfill error path
PR #140 회차 1 actionable 7건 반영:
- docs/SMOKE.md: parser_version "code-rust-ast-v1" → "code-rust-v1" (chunker_version 과 혼동); jq path .citation.code_lang → .citation.lang (wire 의 code_lang 은 SearchHit top-level)
- docs/ARCHITECTURE.md: Mermaid pcode→ptypes 잘못된 edge → pcode→core 로 정정 (kebab-parse-code Cargo.toml 실제 dep 와 일치); 디렉토리 트리에서 code-rust-ast-v1 chunker 표기 위치 kebab-parse-code → kebab-chunk 로 정정
- crates/kebab-app/src/app.rs: backfill_repo 의 .ok().flatten() 실패 silent swallow → tracing::warn 로 관측 가능, 비-abort 의도 보존
- crates/kebab-parse-code/src/rust.rs: impl_item arm 의 "function_item 만 unit 생성" 1A scope 한정 주석을 외부에서도 보이도록 arm 상단에 한 줄 추가 (내부 주석은 유지)

verify: kebab-parse-code 7/7 / kebab-app --lib 51/51 / code_ingest_smoke 3/3 green; touched-crate clippy clean (재부팅 전 검증).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:24:20 +00:00
da51e59081 feat(p10-1a-2): populate schema.v1 code_lang_breakdown
Add `SqliteStore::code_lang_breakdown()` that queries
`json_extract(metadata_json, '$.code_lang')`, groups by it, and
skips NULL rows — returning `BTreeMap<String, u32>`.

Wire it into `collect_stats` in `kebab-app::schema`, replacing the
`BTreeMap::new()` placeholder inserted by 1A-1.

Test: `store::tests::code_lang_breakdown_counts_by_code_lang` asserts
rust=1 and that a null-code_lang doc does NOT appear in the map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:41:52 +00:00
11a0fc758f docs(p10-1a-2): note backfill invariant at search_with_opts non-trace path (Task 8 review)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:20:13 +00:00
b5d1fe8c1e feat(p10-1a-2): backfill SearchHit.repo from doc metadata (Task 8b)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:13:01 +00:00
580576c2c6 feat(p10-1a-2): wire code ingest dispatch (ingest_one_code_asset)
Add `MediaType::Code("rust")` dispatch arm in `ingest_one_asset`,
`ingest_one_code_asset` fn (faithful mirror of `ingest_one_pdf_asset`),
and `backfill_code_lang` post-processing in `App::search_uncached`.
Integration test `code_ingest_smoke.rs` verifies full pipeline:
ingest `.rs` → Citation::Code hit with lang/symbol/line_start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 20:14:59 +00:00
7a6a24ad10 feat(p10-1a-2): add MediaType::Code(lang) variant
TDD: red → green cycle confirmed. New `Code(String)` variant serializes
as `{"code":"rust"}` via serde `rename_all = "lowercase"`. All exhaustive
`match` sites updated (`media_label`, `ingest_one_asset` catch-all →
explicit or-pattern). Design §3.5 enum listing synced. Also fix
`/target` symlink gitignore pattern so integration-test binary lookup
via workspace-relative path works with CARGO_TARGET_DIR redirect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:14:45 +00:00
th-kim0823
298f4adc81 feat(p10-1a-1): CLI filter flags + SchemaStats breakdowns + regression tests
Task 13: add wire regression tests proving markdown SearchHit omits
repo/code_lang when None, and all 5 original Citation variants serialize
byte-identically without spurious Code-variant keys.

Task 15: add --repo (repeatable) and --code-lang (repeatable,
comma-separated) flags to `kebab search`; propagate both into
SearchFilters instead of the previous vec![] stub. Add
#[allow(clippy::large_enum_variant)] — Cmd is short-lived, boxing buys
nothing.

Task 16: add code_lang_breakdown and repo_breakdown BTreeMap fields to
Stats (schema.v1); derive Default on Stats; populate both as empty in
collect_stats (1A-2 fills them when code chunks land). Add unit test
asserting both keys are present in the serialized object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 17:21:59 +09:00
th-kim0823
4e8b70a04b feat(p10-1a-1): apply generated-header + size-cap skip per file
Wire kebab_parse_code::is_generated_file and is_oversized into
FsSourceConnector::scan_with_skips. Files that pass gitignore/builtin/
kebabignore matching are now checked for generated-file markers
(config-gated via ingest.code.skip_generated_header) and byte/line caps
(ingest.code.max_file_bytes / max_file_lines). FsScanSkips gains
skipped_generated + skipped_size_exceeded counters; kebab-app threads
them into IngestReport. Also fixes a pre-existing clippy::derivable_impls
warning in IngestCfg. Three new connector tests cover all three paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 17:06:59 +09:00
th-kim0823
9fce24b106 feat(p10-1a-1): wire IngestReport skip counters by category (gitignore/builtin/kebabignore)
Refactor walker to expose WalkOverrides (combined + per-source matchers),
add walk_files_with_skips that returns accepted files alongside skip
attribution, wire FsSourceConnector::scan_with_skips into kebab-app so
IngestReport.skipped_gitignore, skipped_kebabignore, skipped_builtin_blacklist,
and skip_examples are populated instead of left at zero. Priority order
per spec §5.2 (builtin > gitignore > kebabignore) enforced in classify_skip,
with a directory-aware builtin matcher so pruned directory entries are
correctly attributed to builtin rather than a coincident gitignore entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 16:42:28 +09:00
th-kim0823
351c7a0826 feat(p10-1a-1): add IngestReport skip counters + SkipExamples
Adds five new u32 counters (skipped_gitignore, skipped_kebabignore,
skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded)
and a SkipExamples struct (≤5 sample paths per category) to
IngestReport. All new fields are #[serde(default)] for backward-compat
deserialization. Downstream literal construction sites patched with
zeros/empty; snapshot re-baked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:28:19 +09:00
th-kim0823
fa4eeb5a87 feat(p10-1a-1): add SearchHit.repo / code_lang + SearchFilters.repo / code_lang
Wire two new optional fields onto SearchHit (skip_serializing_if = None)
and two Vec<String> filter fields onto SearchFilters (serde default).
Add RetrievalDetail::Default impl (manual, uses SearchMode::Hybrid as
sentinel). Patch all downstream SearchHit / SearchFilters literal
constructors with repo: None / code_lang: None / vec![] as appropriate.
Also covers Citation::Code arm in kebab-eval metrics match.
2026-05-15 15:04:23 +09:00
th-kim0823
441f1192ee docs(fb-42): wire schema + README + SMOKE + design + SKILL + INDEX
- Add bulk_search_item.v1 + bulk_search_response.v1 wire schemas
- Register both in WIRE_SCHEMAS const
- README: --bulk flag mention + MCP tool list 7→8 (bulk_search)
- SMOKE: bulk multi-query walkthrough (CLI + MCP equivalent)
- Design §2.2: Bulk multi-query (fb-42) subsection (additive minor)
- SKILL: mcp__kebab__bulk_search section + tool table row
- Task spec status open→completed, banner replaced
- INDEX: fb-42 row 머지 (rerank hint deferred)
- Fix: missed Capabilities {bulk_search} in cli wire.rs test (Task 7 leftover)
- Fix: missed tools.len() 7→8 in cli_mcp_smoke (Task 5 leftover)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:07:36 +09:00
th-kim0823
e8da415624 feat(schema): bulk_search capability flag (fb-42)
- Capabilities.bulk_search: true (snapshot)
- schema.v1 wire required list updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:49:09 +09:00
th-kim0823
1ebbd6b711 feat(app): bulk_search_with_config facade (fb-42) 2026-05-10 20:18:49 +09:00
th-kim0823
f7e2072d66 test(cli): integration tests for --trace + schema breakdowns (fb-37)
Also fixes App::search_with_opts trace branch to use NoopRetriever
for SearchMode::Lexical, removing the embeddings requirement when
the user only wants lexical-mode trace.
2026-05-10 13:21:33 +09:00
th-kim0823
69037c313a feat(app): SearchResponse.trace + opts.trace threading (fb-37)
Adds the `trace: Option<SearchTrace>` field to `SearchResponse` and
threads `SearchOpts.trace` through `App::search_with_opts`. When the
caller sets `opts.trace = true` the path bypasses the LRU search cache
and runs through `HybridRetriever::search_with_trace`, which dispatches
all 3 SearchModes internally; this means `--trace` requires embeddings
(same constraint as `--mode hybrid`). The non-trace path keeps its
exact prior behavior with `trace: None` stamped on the response.

Picked up Task 1 / Task 3 follow-ups in the same commit so the
workspace compiles: SearchOpts struct-literals in kebab-cli/main.rs +
kebab-mcp/tools/search.rs default the new `trace` field to false, and
the schema-wrapper test in kebab-cli/wire.rs fills the new
media_breakdown / lang_breakdown / index_bytes / stale_doc_count fields
on Stats with `Default::default()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 13:01:18 +09:00
th-kim0823
231d80e82d feat(stats): media/lang/bytes/stale fields on schema.v1.stats (fb-37)
Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count
fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold
for callers that need real stale counts. Mirrors all new fields onto the
wire-bound Stats struct in kebab-app::schema with #[serde(default)] for
backwards-compat. Also fixes search_budget_integration.rs for the trace field
added to SearchOpts in Task 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 12:34:57 +09:00
th-kim0823
b86b763dfb fix(fb-35): address PR #126 round 2 review
- wire schema: relax effective_end.minimum 1 → 0 + expand
  description to cover line-clamp + out-of-range sentinel
  (panic-fix R1 emits Some(0) when line_start=1 and range is
  beyond doc end — schema must accept it)
- tests: tighten first-chunk-target boundary test to assert ≤ 2
  total neighbors (3-chunk doc, N=2). Strict "first chunk →
  context_before empty" not assertable until chunks.ordinal
  column lands (R1 #9 architectural caveat)
- store: trim contradiction in list_chunk_ids_for_doc warning
  comment — drop "good enough for sequentially chunked
  markdown" phrase that conflicts with "hash sort dominates"
  paragraph above

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:55:29 +09:00
th-kim0823
7dddc1d706 fix(fb-35): address PR #126 round 1 review
- fetch_span: panic-fix on line_start > total / empty doc
  (return empty text + effective_end = line_start - 1 instead of
  out-of-bounds slice)
- truncated: reserved for budget-driven truncation only; line
  range clamp signaled via effective_end < line_end
- spec / SKILL.md / README: align rejection wording to "PDF /
  audio" (matches code; Image OCR allowed for span)
- store: warning comment on list_chunk_ids_for_doc — chunk_id
  hash sort does NOT preserve document position; real fix is a
  chunks.ordinal column, tracked as follow-up
- surrounding_chunks: saturating_add to defend against u32::MAX
  context arg on 32-bit targets
- tests: line_start > total returns empty + chunk context at
  doc boundary clamps lower bound

Deferred nits (follow-up): table-separator strict CommonMark form;
MCP per-mode strict validation; CLI chunk_id truncation in plain
output. None block correctness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:45:29 +09:00
th-kim0823
1b9d89eb3a feat(app): App::fetch span mode + PDF/audio rejection (fb-35)
Line-based slice over fmt_canonical_to_markdown output.
PDF / audio source_type → span_not_supported StructuredError.
Out-of-range line_end clamps to total; effective_end reflects
post-budget trim. invalid_input on zero / inverted bounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:54:22 +09:00
th-kim0823
7d1f855f7e feat(app): App::fetch doc mode with budget (fb-35)
Walks CanonicalDocument blocks, serializes to markdown, applies
chars/4 budget when opts.max_tokens is set. doc_not_found
preserved through StructuredError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:48:40 +09:00
th-kim0823
610d29f053 feat(app): App::fetch chunk mode + markdown serializer (fb-35)
Chunk mode + +-N context. doc / span modes return placeholder
errors (filled by subsequent tasks). fmt_canonical_to_markdown
helper introduced now since doc mode (Task 4) consumes it.
Errors are typed StructuredError so classify preserves
chunk_not_found / doc_not_found through the wire layer.

Adds SqliteStore::list_chunk_ids_for_doc so the facade can derive
+-N neighbors without leaking direct rusqlite usage into kebab-app.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:44:51 +09:00
th-kim0823
e084b306e5 fix(fb-34): align next_cursor semantics with docs (PR #125 round 2)
Previous round-1 fix dropped the speculative cursor branch on
the truncated path, leaving a contradiction with the docs:
- snippet-only shrunk → cursor emitted (returned == k_effective)
- k-popped → cursor null (returned < k_effective)
But docs promised the opposite.

R2 resolution: emit cursor whenever more hits may be reachable
(either retriever filled the page OR budget popped hits — the
popped ones remain fetchable from offset+returned). Drop the
artificial "widen vs paginate" copy; truncated and next_cursor
are now independent signals — caller may do either or both.

Updates: app.rs::search_with_opts logic + SearchResponse doc +
schema description + SKILL.md two bullets + max_tokens=0 test
asserts cursor IS emitted on k-pop case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 21:07:04 +09:00
th-kim0823
f485608108 fix(fb-34): address PR #125 round 1 review
- error_wire: StructuredError wrapper preserves ErrorV1 through
  anyhow → classify pipeline. Adds downcast short-circuit so
  cursor::decode's typed code = "stale_cursor" reaches the wire
  instead of being string-formatted to code = "generic".
- app: search_with_opts now wraps cursor::decode error in
  StructuredError instead of anyhow! string format.
- test: error_wire pins both negative (bare anyhow → not
  stale_cursor) AND positive (StructuredError → stale_cursor)
  invariants. CLI integration test runs end-to-end and asserts
  error.v1.code on stderr.
- app: next_cursor only emitted on full-page (k-pop) path; drop
  speculative emit on snippet-only truncation that would point at
  a different page than the agent expected.
- cursor: differentiate malformed-base64 / malformed-payload /
  revision-mismatch error messages; all keep code = stale_cursor.
- test: cursor_rejected fixture uses .expect() to fail loud on
  cursor non-emission instead of silent skip.
- test: max_tokens=0 → 1-hit floor + truncated=true.
- docs: SKILL.md + schema description distinguish snippet-shrink
  (widen) vs k-pop (paginate) truncated cases. HOTFIXES notes
  --no-cache semantic shift (cached path + clear vs uncached path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:49:27 +09:00
th-kim0823
e1fcea6313 chore: clippy fix for fb-34 — allow result_large_err on cursor::decode
ErrorV1 is the workspace wire error struct; boxing here would
force every call site to deref through a Box for no win — the
err-path is rare. Single allow at the function level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:20:36 +09:00
th-kim0823
21220f6d39 feat(cli): kebab search --max-tokens / --snippet-chars / --cursor (fb-34)
JSON output wrapped in search_response.v1 (breaking — agent must
adapt). Plain output unchanged + [truncated; use --cursor X]
stderr hint when budget tripped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:02:50 +09:00
th-kim0823
af80cedd81 feat(app): App::search_with_opts + SearchResponse (fb-34)
Budget loop: snippet shorten → k pop → ≥1 hit floor. Cursor
encode/decode threads corpus_revision; mismatch surfaces as
stale_cursor anyhow error. App::search retained as thin wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:59:48 +09:00
th-kim0823
aabe66f5e2 docs(error_wire): note stale_cursor convention (fb-34)
stale_cursor is built by cursor::decode, not classify. Test
locks the invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:50:39 +09:00
th-kim0823
ebbc3a46ae feat(app): cursor encode/decode for paginated search (fb-34)
Opaque base64(JSON{offset, corpus_revision}). Mismatch or
malformed input returns ErrorV1 with code = stale_cursor.
base64 promoted to workspace dep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:49:23 +09:00
th-kim0823
e5c99f5b80 feat(tui): adapt ask worker to StreamEvent sink (fb-33)
Worker channel now carries kebab_app::StreamEvent. drain_stream
matches on Token { delta }; RetrievalDone and Final are ignored
(citations render from last_answer, Final is redundant with
worker join). app::AskState.rx type widened to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:57:46 +09:00
th-kim0823
16db60f7bd docs(fb-32): mirror back-reference + fix pipeline doc-comment ordering
- kebab-app::staleness::compute_stale gains note pointing at the
  kebab-rag mirror so future modifiers know to update both copies.
- kebab-rag::pipeline: doc comments adjacent to compute_stale and
  embedding_ref_for were positioned such that rustdoc would
  misattribute them. Reorder/separate so each comment hugs its
  own function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:51:40 +09:00
th-kim0823
dfef65f196 feat(app): staleness module + post-process search hits (fb-32)
compute_stale: strict > boundary, threshold=0 disables, future
timestamps treated as fresh (clock skew safety). App::search
re-stamps on cache hit so config threshold changes take effect
without flushing the cache.

Also unblocks the workspace build by plugging placeholder
indexed_at/stale into the two AnswerCitation construction
sites in kebab-rag/pipeline.rs (the score-gate refusal path
forwards from SearchHit; the LLM-citation path uses
UNIX_EPOCH/false until Task 7 wires the real values through
pack_context).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:30:10 +09:00
th-kim0823
7f5739d8fb 🏗️ refactor(fb-31): apply round 1 review nits
- ingest_file_with_config: lowercase normalize ext (caller-side) +
  early error on unsupported extension (`.docx` etc. now Err with
  helpful message instead of silent skipped_by_extension counter).
  New test ingest_file_errors_on_unsupported_extension.
- ingest_stdin_with_config: doc comment explaining intentional
  double-call of ensure helpers (idempotent + ~ms negligible).
- external::inject_frontmatter: simplify precheck via single
  trim_start binding + add CR-only line ending edge case.
- external::inject_frontmatter: doc note on yaml_quote escape
  contract (agent-supplied titles with special chars are safe).

Round 1 review summary: #111 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:46:55 +09:00
th-kim0823
dc24cb34b1 🚑 fix(fb-31): apply final review nits
- kebab-app: add #[doc(hidden)] to ingest_stdin_with_config (CLAUDE.md
  convention — all *_with_config functions should have this attribute;
  fb-31's first impl missed it on the second facade fn).
- SKILL.md: "Since v0.4.0" → "Since v0.3.1" (MCP shipped in fb-30
  release v0.3.1; the wrong version claim was introduced in fb-30 doc
  sync and carried forward into fb-31).
- tools_call_ingest_file: add idempotency test (second call with same
  content → unchanged=1, new=0). Spec called for two tests; first impl
  shipped only the happy path.

Version bump 0.3.1 → 0.3.2 deferred to separate `chore/bump-v0.3.2` PR
mirroring fb-27 + fb-30 precedent (commits 73f5d73 / 5495d96).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:38:19 +09:00
th-kim0823
a42f907640 🧪 test(kebab-app): ingest_stdin_with_config integration (fb-31)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:04:52 +09:00