Commit Graph

75 Commits

Author SHA1 Message Date
35c987df1c feat(app): log retention — keep_recent_runs + retention_days (Enhancement 4)
LoggingCfg gains two fields with serde defaults: keep_recent_runs
(default 100, top-N file retention) and retention_days (default 30,
time-based retention for both ndjson files and the SQLite mirror).

IngestLogWriter::open now runs cleanup_old_logs before creating a new
ingest-*.ndjson — delete iff (idx >= keep_recent) OR (modified <=
cutoff). ingest_with_config_opts also calls
SqliteStore::prune_pdf_ocr_events(retention_days) at ingest start so
the SQLite mirror tracks the same retention window.

Backward compat (AC-9): both new fields use #[serde(default = ...)],
so a pre-v0.20.x config with only [logging] ingest_log_enabled +
ingest_log_dir parses unchanged. kebab init writes the new defaults
automatically via Config::default() -> toml::to_string_pretty (AC-12).

docs/SMOKE.md config example synced.

Closure r1 F5: explicit OR-on-stale comment inside cleanup_old_logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:17:47 +00:00
d9ec7b8dc3 feat(cli): kebab inspect ocr-stats + ocr-failures (Enhancement 3 + wire schema additive minor)
Two new wire schemas land as additive minor: ocr_stats.v1 (corpus-wide
aggregate — total_events, success_rate, p50/p90/p99/max_ms, by_engine,
top-10 by_doc by failure count) and ocr_failures.v1 (per-doc or
corpus-wide recent failures, with --doc-id + --limit). Both ship via
new CLI subcommands `kebab inspect ocr-stats` / `inspect ocr-failures`.

App gains four facade methods: inspect_ocr_stats /
inspect_ocr_failures plus their *_with_config companions — required by
CLAUDE.md "the facade rule" so `--config <path>` is honored. The CLI
dispatch arms thread cfg explicitly into the _with_config form.

Runtime introspection emit (WIRE_SCHEMAS in schema.rs) gains two
entries; the meta JSON Schema (schema.schema.json) is untouched
because its wire.schemas is pattern-based, not enum-based.

ingest_log::percentiles extended to (p50, p90, p99, max). p99 surfaces
only via inspect ocr-stats; IngestSummary (round 1) stays 3-percentile.

SKILL.md synced with the two new schemas (AC-13).

Closure r2 G2 (facade *_with_config pair) + G3 (runtime emit, not
meta schema file) + closure r1 F4 (p99) resolved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:13:08 +00:00
4e451c9f7c feat(app): dual-write PDF OCR events to SQLite + ndjson (Enhancement 2 wiring)
Pre-capture canonical.doc_id and Arc<SqliteStore> before the OCR
emit_progress closure so both the ndjson file and the SQLite mirror
carry the same doc_id for every event. File write is durable
(errors propagate); SQLite insert is non-critical (tracing::warn on
failure, ingest does not abort) per spec R-1.

LogEvent::Ocr gains a doc_id: Option<&str> field as an additive
Serde change — round 1 ndjson logs deserialize with doc_id=None.

Closure r1 F1: doc_id NULL in dual-write resolved via
let doc_id_for_log = canonical.doc_id.0.clone() pre-capture.
Closure r2 G1: Arc::clone(&app.sqlite) reused instead of opening a
second SqliteStore — eliminates double-open lock contention and
duplicate migration runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:06:03 +00:00
685007789a style: cargo fmt --all (round 4 ingest log feature follow-up)
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:18:40 +00:00
445b096215 fix(test): clippy + fmt fixes for logging_roundtrip and ingest_log_smoke
* kebab-config/tests/logging_roundtrip.rs: r#"..."# → plain string
    (clippy::unnecessary_hashes).
  * kebab-app/tests/ingest_log_smoke.rs: |e| e.ok() → Result::ok,
    |s| s.as_u64() → Value::as_u64 (clippy::redundant_closure).
  * cargo fmt --all applied to pre-existing formatting drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:26:42 +00:00
415227bf76 test(app): ingest_log_smoke integration test (AC-9)
crates/kebab-app/tests/ingest_log_smoke.rs 신규:

  * ingest_log_smoke (AC-9): tempdir + 1 md + 1 scanned PDF →
    ingest → assert log file exists + 각 line valid JSON +
    각 kind ∈ {ocr,parse_error,skip,error,summary} + last
    line kind=summary + scanned>0.

  * ingest_log_disabled_emits_no_file (AC-6): enabled=false 일
    때 log_dir 안 ingest-*.ndjson 파일 0개 verify.

fixture: ../kebab-parse-pdf/tests/fixtures/scanned_page1.pdf
재사용 (OCR disabled — Ollama 없이 smoke 실행).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:06:43 +00:00
9b44e27dfe test(app): update schema_report assertion for streaming_ask=true (Bug #9 follow-up)
schema_report_reflects_freshly_ingested_kb 가 `!streaming_ask` 를 assert 했으나
Bug #9 fix (760eee8) 로 streaming_ask 가 true 로 정정됨. assertion 을 반전.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:58:10 +00:00
d9c7aabce1 feat(schema): add active_parsers + active_chunkers arrays to schema.v1.models (Bug #13)
이전: schema.v1.models 가 parser_version / chunker_version 단일 값만 보고 →
multi-medium corpus (md + pdf + code Rust/Python + dockerfile + k8s + manifest)
의 version cascade audit 누락 risk.

이후: additive minor — Models struct 에 active_parsers + active_chunkers Vec<String>
추가. backward compat: 기존 단일 field 보존 (markdown default), 신규 array 는
optional (#[serde(default)] + JSON schema required 미포함).

source:
- kebab_store_sqlite::fetch_distinct_parser_versions() 가
  documents.parser_version DISTINCT + ORDER BY 반환.
- fetch_distinct_chunker_versions() 가 chunks.chunker_version 동일 pattern.
- collect_models 가 매 schema 호출마다 재계산 (cache 없음 — R-3 자동 해결).

wire schema additive only — 메이저 bump 불필요. v0.20.1 minor 로 충분.
integrations/claude-code/kebab/SKILL.md 동기 갱신.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:15:58 +00:00
241ded59df test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression
for Bug #3 (intra-doc chunk_id collision under aggressive overlap).

- `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append.
- `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift +
  `single` / `per_page` ctor (backward-compat single + per-page cursor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 +
  `mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call
  site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs`
  (new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " +
  500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet
  dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape).
- C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock
  injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test
  (apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3)
prior: 436fd01 (Step 2 Bug #3 chunk_id fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:45:38 +00:00
436fd015a2 fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)
v0.20.0 sub-item 1 dogfood report 의 Bug #3 (Critical). scanned_page2.pdf
(1580 char OCR text) ingest 시 `chunks.chunk_id` PRIMARY KEY violation —
`per_chunk_hash = #c{char_start}` 가 post-overlap `actual_start` 사용 +
overlap walk floor 가 `prev_min` 으로 collapse → segment 1/2 동일 `#c0`.

- `crates/kebab-chunk/src/pdf_page_v1.rs`: `chunk_page` returns 4-tuple
  (segment_start, actual_start, chunk_end, slice); caller `per_chunk_hash`
  suffix uses `segment_start` (pre-overlap boundary, strictly increasing)
  instead of `char_start` (post-overlap, may collapse to prev_min).
- VERSION_LABEL `"pdf-page-v1"` → `"pdf-page-v1.1"` (design §9 cascade,
  explicit user-facing audit trail). `crates/kebab-app/tests/pdf_pipeline.rs:
  168, 368` 의 hardcoded literal 도 v1.1 로 갱신.
- module docs (`pdf_page_v1.rs:47-60`): workaround description 의
  `#c{char_start}` reference 를 `#c{segment_start}` 로 갱신 + segment_start
  invariant 명문 + HOTFIXES.md cross-ref.
- `pdf_page_v1.rs::tests`: `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
  regression pin (10 char "가" + ". " + 500 char "나" — multi-chunk +
  overlap walk collapse trigger).
- `tasks/HOTFIXES.md`: 2026-05-27 entry (symptom F2 1580 char OCR,
  intra-doc collision root cause, second-iteration patch rationale).

spec:  docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4)
plan:  docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 2)
prior: d9acda5 (Step 1 Bug #2 walker fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:32:09 +00:00
48197687b7 test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1
Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규):
- ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama,
  ingest_with_config production path + IngestItem.pdf_ocr_pages verify.
- ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search
  의 OCR text indexed verify (§ Acceptance #2).
- ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set
  cancel=true + dummy endpoint, no panic/deadlock verify).

I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규):
- vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector
  PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant).
- baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create).
- normalize_provenance_timestamps inline helper (R-3 mitigation, workspace
  전체 부재 — 신규 12-line).

I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규):
- f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real
  Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation.
- alnum metric = strsim::levenshtein (dev-dep 추가).
- truth file copy from PoC scratch (page1.txt + page2-batchim.txt) →
  scanned_page1_truth.txt + scanned_page2_truth.txt.
- kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용).
  parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph
  baseline -e normal 보존).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5)
prior: c9e0594 (Step 8 CLI printer)
contract: §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 10:10:58 +00:00
c9e05941c5 feat(cli): activate per-page PDF OCR progress printer + test(app): ingest_progress emit verify + spec(pdf-ocr): align §4.6.1 literal with option_A (ms/chars)
Step 8 (Group H) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 reviewer concern fix (spec literal deviation).

H1 — kebab-cli/src/progress.rs printer activation:
- 구 no-op stub `IngestEvent::PdfOcr* { .. } => {}` (Step 6 placeholder)
  를 사람-친화 stderr line printer 로 활성화.
- spec §4.6.1 line 1085-1086 wording 그대로:
  - PdfOcrStarted → `  📷 OCR page {page}...`
  - PdfOcrFinished (skipped=false) → `  ✓ OCR page {page} ({chars} chars, {ms}ms via {ocr_engine})`
  - PdfOcrFinished (skipped=true)  → `  ⊘ OCR page {page} skipped (no DCTDecode or engine fail, {ms}ms)` (M-4 의 skipped field carry 활용)
- `!quiet` gate 정합 (AssetStarted/Finished pattern mirror).

H2 — crates/kebab-app/tests/ingest_progress.rs 의 새 test:
- pdf_ocr_progress_emits_started_finished_events (real Ollama 의존, `#[ignore]`).
- F1 fixture (scanned_page1.pdf) ingest 시 pdf_ocr_started + pdf_ocr_finished
  event 가 emit 됨을 verify. Started count == Finished count invariant.
- Manual invoke: `KEBAB_PDF_OCR_ENABLED=true cargo test -p kebab-app --test
  ingest_progress --ignored`.
- mock OcrEngine inject path 부재 (Step 6 의 eager build), Step 9 I5 의
  ocr_e2e pattern (real Ollama + `#[ignore]`) 와 동일.

Step 7 reviewer concern fix — spec §4.6.1 literal:
- line 1076-1077 의 `ocr_ms` / `ocr_chars` literal 을 wire schema 의 실제
  field name `ms` / `chars` (option_A, Rust serde 와 정합) 로 갱신.
- line 1087 의 printer wording 도 `{ocr_chars}` / `{ocr_ms}` → `{chars}` / `{ms}`.
- line 1556 의 rationale 참조 `pdf_ocr_finished.ocr_ms` → `.ms`.
- `skipped` field 도 명시 (Step 6 reviewer M-4 결과).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.6.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 8 H1+H2)
prior: 4c5ccd5 (Step 7 wire schema) — Step 7 reviewer concern 1 의 fix
contract: §9 (additive minor wire bump — Step 7 commit 에서 완료)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 09:18:49 +00:00
9f003ef1cd feat(app): add pdf_ocr_apply helper (10 test, F7 split + cancel) — post-extract OCR enrichment for PDF (H-1 resolution)
Step 4 (Group D) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

D1 — `apply_ocr_to_pdf_pages(&mut canonical, &dyn OcrEngine, &bytes, &opts, emit_progress)`
in `kebab-app::pdf_ocr_apply`. spec §4.1 line 381-599 body 그대로 +
PdfOcrOpts.cancel field + per-page cancel check (verifier LOW L-1).

post-extract enrichment pattern (H-1 resolution): kebab-parse-pdf 가
kebab-parse-image::OcrEngine 을 import 하지 않음 (parser isolation 보존).
helper 가 kebab-app 의 facade 안 — both parser crate 의 cross-import 회피.

Per-page decision matrix (spec §4.1 line 459-464):
- always_on=true → 모든 page OCR (dual-block, ordinal = page-1 + page_count).
- always_on=false + needs_ocr → in-place OCR (text-detect block mutate).
- needs_ocr=false → skip.

DCTDecode-only v1 (H-3): FlateDecode / CCITTFaxDecode page 는
extract_dctdecode_page_image=None → Warning event + skip + emit_progress(skipped=true).

OcrEngine.recognize 실패 → Warning event + skip + emit_progress(skipped=true).

D3 — per-page cancel handle (verifier LOW L-1 + spec §4.8 line 1159):
PdfOcrOpts.cancel: Option<Arc<AtomicBool>>. set→true 시
`anyhow::bail!("PDF OCR cancelled mid-PDF at page N")`.

lopdf = "0.32" added to [dependencies] (already transitive via kebab-parse-pdf;
no new crate introduced — dep graph kebab-parse-* baseline unchanged).

Integration test (`tests/pdf_ocr_apply.rs`, 10 test):
- f1_input_with_ocr_enabled_replaces_empty_block — in-place mutate.
- f3_input_with_ocr_enabled_keeps_text_detect_blocks — vector PDF skip.
- f1_input_with_ocr_disabled_keeps_empty_block — disabled no-op.
- f4_input_with_ocr_enabled_replaces_mojibake_block — mojibake → in-place mutate.
- f3_input_with_always_on_pushes_dual_blocks — always_on dual-block.
- f6_flatedecode_skipped_with_warning — FlateDecode skip + Warning event.
- f7_ccittfax_skipped_with_warning — CCITTFax skip + Warning event (verifier M-4 split).
- ocr_engine_failure_surfaces_as_warning — OCR failure → Warning event.
- dual_block_ordinals_are_deterministic_and_unique — ordinal invariant.
- cancel_handle_aborts_mid_pdf — cancel handle 의 production source (D3).

MockOcrEngine fixture: spec §5.5 line 1284-1299. F3 fixture 부재 →
mock CanonicalDocument construction + F1 bytes reuse pattern (Option B:
PdfTextExtractor::extract 를 통한 실제 production path canonical 생성).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 + §5.5)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 4 D1+D2+D3)
prior: c2cd3a7 (Step 3) + 8d81bc1 (Step 3 clippy fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 06:42:01 +00:00
7c27633df2 chore(rag): post-PR9 refactor — H1/H2/H3/D/E + test coverage + post-refactor dogfood retest
OMC team `post-pr9-refactor` 의 architectural cleanup. architect priorities 분석 후 executor + test-engineer 가 file edits, system-architect 가 component-level review 로 *pre-cut nothing — all v0.18.1+ defer* 결론.

## Executor 작업 (H1/H2/H3/D/E)

- **H1** (kebab-nli/src/onnx.rs): `[models.nli]` config wire 활성화. `DEFAULT_MODEL_ID` const 제거 (kebab-config 의 NliCfg::defaults 가 single source). OnnxNliVerifier::new 가 config.models.nli.model 읽고 config.models.nli.provider 가 "onnx" 아니면 anyhow::bail. 3 stale "PR-9c-1 will wire this" 코멘트 제거. 2 unit test 추가 (`new_uses_config_model_id`, `new_rejects_unsupported_provider`).
- **H2** (kebab-rag/src/pipeline.rs): `truncate_for_nli(premise: &str, _hypothesis: &str)` → `truncate_for_nli(premise: &str)`. v0.18.1 placeholder doc 제거. 4 callsite (tests/multi_hop.rs) 갱신 + test rename `multi_hop_truncate_for_nli_preserves_hypothesis` → `multi_hop_truncate_for_nli_char_budget` (contract 정합).
- **H3** (kebab-rag/src/pipeline.rs:1041): `was_truncated` 가 tracing::debug! 으로 surface (observability 추가, signature 보존 — caller logging contract).
- **D** (kebab-mcp/tests/tools_call_ask_multi_hop.rs): request_timeout_secs 2 → 5 (slow CI 안정성), `mh_code` discriminator 제거. dispatch contract = `mh.is_error.unwrap_or(false)` (기존 assertion 으로 충분).
- **E** (tasks/HOTFIXES.md + pipeline.rs:1633-1638): fb-41 PR-9 closure entry 의 sibling 으로 "### PR-9 NLI refusal: terminal Synthesize hop omitted from hops trace" subsection 추가. pipeline 의 "cleanup deferred to a follow-up" → "// See tasks/HOTFIXES.md ... for follow-up" cross-link.

## Test-engineer 작업 (T1/T2/T3/T4, 9 new tests)

- **T1** (kebab-nli/src/onnx.rs::tests): sanitize_model_id 3 unit (replaces_slash / idempotent / leaves_other_chars).
- **T2** (kebab-rag/tests/multi_hop_nli_panic.rs 신규): 2 panic-path tests — facade invariant (`expect("verifier must be Some when nli_threshold > 0.0")`) 의 #[should_panic] + threshold=0 의 companion.
- **T3** (kebab-rag/tests/multi_hop_nli_stream.rs 신규): 2 StreamEvent::Final tests — refuse_nli_verification + refuse_nli_model_unavailable 의 stream_sink Final 분기 wire shape pinning.
- **T4** (kebab-app/tests/open_with_config_nli.rs 신규): 2 NLI failure path — model_dir 가 unwritable 일 때 App::open_with_config 의 Result<App> Err (with "OnnxNliVerifier" in chain) + threshold=0 일 때 graceful skip.

## System-architect 결론

3 lenses (absorption / duplication / under-engineered interface) 분석 결과 — *pre-cut nothing*. Top-3 items 모두 v0.18.1+ defer:
- Lens 1: kebab-normalize + kebab-parse-types 흡수 가능 (parse-md 만 사용, 5 parsers 우회) → v0.18.1+.
- Lens 3: Extractor + Chunker trait 의 dead polymorphism (모든 callsite 가 hardcoded) → v0.18.1+.
- Lens 1 bundled: kebab-source-fs 가 kebab-parse-code 의 9 tree-sitter grammars drag → low-risk dep-graph win, v0.18.1+ bundled.
- Defer-with-intent: LanguageModel async refactor (cloud-LLM 시), NliVerifier::score_batch + typed NliError (2nd impl 시), compute_stale → kebab-core::stale.

보고서: /build/cache/tmp/post-pr9-refactor-priorities.md, /build/cache/tmp/system-architecture-priorities.md (둘 다 repo 외 — analysis 보존).

## 검증

- cargo test -p kebab-nli -j 1 → 11/11 pass.
- cargo test -p kebab-rag -j 1 → 75/75 pass (5 NLI multi-hop + 4 신규 T2/T3 포함).
- cargo test -p kebab-app -j 1 → 23 pass + 2 ignored (T4 의 2 포함).
- cargo test -p kebab-mcp --test tools_call_ask_multi_hop -j 1 → 1 pass + 1 pre-existing flaky (HOTFIX #15, no_chunks short-circuit, executor D fix 와 무관 — line 86 의 base assertion 이 fixture 없어서 fail).
- cargo clippy --workspace --all-targets -j 1 -- -D warnings clean.
- cargo test --workspace --no-fail-fast -j 1 → 1304 passed (+11 new) + 1 pre-existing flaky 동일.
- **Post-refactor dogfood retest byte-identical** (PR-9d / post-cleanup / post-refactor 3번 모두): S7 0.0035389824770390987, S1 0.058334656059741974, S10 0.0027875436935573816, S3 nli_model_unavailable.

docs/dogfood/v0.18.0/SUMMARY.md 에 "Post-architectural-refactor retest" section 추가.

Wire 영향: 없음.
Behavior 영향: 없음 (H1 의 config wiring 가 default 와 같은 model → byte-identical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 04:42:37 +00:00
7c85de065a chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix
cut PR v0.18.0 전 마지막 정리. 사용자 요청: "전체 코드베이스를 깔끔하고 알아보기 쉽게".

## Workspace lints

- `Cargo.toml` 의 `[workspace.lints.clippy]` 에 `pedantic = "warn"` (priority -1) + 의도적 allow-list 추가:
  - cast_possible_truncation / cast_possible_wrap / cast_sign_loss / cast_precision_loss — ONNX i64 / hash modular reduction 등 의도적 truncation.
  - doc_markdown / missing_errors_doc / missing_panics_doc — cosmetic doc style.
  - too_many_lines / module_name_repetitions / must_use_candidate / needless_pass_by_value / manual_let_else / items_after_statements / similar_names — informational only.
  - format_collect / match_wildcard_for_single_variants / trivially_copy_pass_by_ref / unnecessary_wraps — intentional patterns (exhaustive match, future Result variants 등).
  - default_trait_access — `Foo::default()` 가 idiomatic.
  - float_cmp — NLI / RRF score 의 explicit threshold 비교 의도.
  - struct_excessive_bools / case_sensitive_file_extension_comparisons / naive_bytecount / ignore_without_reason — domain-specific 의도.
  - format_push_string / return_self_not_must_use / match_same_arms — builder / wire-label / hot-path 패턴 보존.
  - needless_continue / used_underscore_binding / nonminimal_bool / unreadable_literal / many_single_char_names / doc_link_with_quotes / assigning_clones / collapsible_str_replace / trivial_regex / elidable_lifetime_names / range_plus_one / explicit_iter_loop / implicit_hasher / ref_option — remaining low-value style.
- 각 24 crate `Cargo.toml` 에 `[lints] workspace = true` 추가.

## Auto-fix

`cargo clippy --workspace --all-targets --fix` 적용 — 128 files changed, 552 insertions / 472 deletions. 주로:
- uninlined_format_args (~18): `format!("{}", x)` → `format!("{x}")`.
- redundant_closure_for_method_calls (~33): `.map(|x| x.foo())` → `.map(T::foo)`.
- 그 외 mechanical refactor.

## 검증

- `cargo clippy --workspace --all-targets -j 1 -- -D warnings` clean (pedantic + 모든 lint group).
- `cargo test --workspace --no-fail-fast -j 1` — **1293 tests pass + 1 pre-existing flaky fail** (`kebab-mcp::tools_call_ask_multi_hop::ask_tool_routes_multi_hop_true_to_decompose_first`, HOTFIX candidate, cleanup 무관). 회귀 0.

Wire 영향: 없음.
Behavior 영향: 없음 (mechanical refactor only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 03:01:58 +00:00
cf35f36f88 feat(rag): fb-41 PR-2 — RagPipeline::ask_multi_hop skeleton (fixed depth=2)
PR-2 of fb-41 multi-hop RAG. Decompose + retrieve + synthesize 3-stage
pipeline가 `opts.multi_hop=true` 일 때 dispatch. Dynamic decide loop
는 PR-3.

- `AskOpts.multi_hop: bool` 필드 추가 + `impl Default for AskOpts`
  도입 (HOTFIXES 2026-05-07 의 known limitation 해소). 9 explicit
  init site 모두 `multi_hop: false` 추가 — Default 도입으로 향후
  `..Default::default()` 점진 migrate 가능.
- `RagPipeline::ask` 의 entry 에 dispatcher 한 줄
  (`if opts.multi_hop { return self.ask_multi_hop(...) }`).
- `RagPipeline::ask_multi_hop` 신규 method. 1) decompose LLM call
  → JSON array of strings parse, 2) 각 sub-query 로 retrieve +
  chunk_id dedup pool, 3) score gate / no-chunks 가드, 4)
  pack_context (single-pass 와 helper 공유), 5) synthesize LLM
  call w/ MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT, 6) citation extract
  + Answer build. `prompt_template_version` = "rag-multi-hop-v1"
  로 stamp — eval `compare` 가 single-pass vs multi-hop 분리.
- Prompt const 신규: MULTI_HOP_DECOMPOSE_SYSTEM_PROMPT +
  MULTI_HOP_DECOMPOSE_USER_TEMPLATE + MULTI_HOP_SYNTHESIZE_SYSTEM_PROMPT
  + PROMPT_TEMPLATE_VERSION_MULTI_HOP + MULTI_HOP_MAX_SUB_QUERIES_DEFAULT.
- `kebab_core::RefusalReason::MultiHopDecomposeFailed` variant 신규.
  Cascade: kebab-store-sqlite `refusal_reason_label` + kebab-tui `ask
  refusal render` exhaustive match 갱신.
- `parse_decompose_response` + `strip_markdown_json_fence` helper —
  markdown code fence (```json / ```) strip + JSON array of strings
  parse + trim + drop empty + cap at MULTI_HOP_MAX_SUB_QUERIES_DEFAULT.
  None 반환 시 caller 가 `MultiHopDecomposeFailed` refusal.

Tests (55 passing total, 8 신규):
- 6 unit (parse_decompose_response 의 bare array / fence variants /
  garbage / cap / trim 회귀 핀).
- 2 integration: `ask_multi_hop_dispatches_and_decompose_garbage_refuses`
  (decompose garbage → MultiHopDecomposeFailed + 정확히 1 LLM call) +
  `ask_with_multi_hop_false_keeps_single_pass_path` (회귀 핀, 기존
  caller 자동 backwards-compat).

Happy-path multi-hop (decompose 성공 → synthesize) 의 integration
test 는 ScriptedLm helper 가 PR-3 의 decide loop 와 함께 도입될
때 같이 추가. 현 `MockLanguageModel` 는 canned single response 라
2-LLM-call sequence 핀 불가.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 06:45:32 +00:00
93ddece111 feat(v0.17.0/PR-B/B1): C typedef extractor + parser_version bump + orphan purge cascade
closure of HOTFIXES 2026-05-21. C typedef-wrapped anonymous
struct/enum/union 이 typedef alias 이름으로 symbol unit 방출.

- crates/kebab-parse-code/src/c.rs: type_definition 분기 추가.
  inner anonymous struct_specifier / enum_specifier / union_specifier
  탐지 → declarator field 의 type_identifier 재귀 추출 → synthetic
  unit (typedef alias). named inner aggregate / plain alias 는
  기존대로 glue. PARSER_VERSION code-c-v1 → code-c-v2.
  recover_typedef_alias + extract_typedef_alias_name helper 추가.

- crates/kebab-store-sqlite/src/store.rs: 두 helper 신규
  (parser_version bump cascade 용 doc-id 기반 orphan purge).
  - stale_chunk_ids_for_workspace_path_except_doc_id(workspace_path,
    keep_doc_id) — sister of stale_chunk_ids_at, doc_id 기반.
  - purge_document_at_workspace_path_except_doc_id(workspace_path,
    keep_doc_id) — CASCADE document/chunks 제거, assets 보존.
  keep_doc_id="" 가 "모든 doc 제거" 사용.

- crates/kebab-app/src/lib.rs: try_skip_unchanged 의 parser_mismatch
  분기에서 purge_workspace_path_for_parser_bump 호출. helper 가
  app.vector() 로 lazy 접근 + delete_by_chunk_ids + SQLite document
  row 제거. Ok(None) 반환 전 cleanup 끝나서 caller 의 새 INSERT 시
  idx_docs_workspace_path UNIQUE 충돌 회피.

- tests:
  - c.rs unit tests 4 신규 — typedef_struct_emits_unit /
    typedef_enum_emits_unit / typedef_union_emits_unit /
    typedef_to_existing_type_stays_glue (negative).
  - tier1_c_ingest_searchable: parser_version assertion code-c-v1 →
    code-c-v2.
- 회귀: bytes-edit 경로 (asset_id 변경) 의 기존 purge_orphan_at_workspace_path
  + purge_vector_orphans_for_workspace_path 는 그대로 — 신규 분기와
  공존, 기존 test 모두 PASS.

미해결 (Risks): nested typedef (typedef struct { struct {...} inner; }
Outer;) 의 inner 익명 struct 는 여전히 glue — v2 의 1차 범위는
top-level typedef alias 만.

cargo test --workspace --no-fail-fast -j 1 + clippy 통과.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 20:30:57 +00:00
0ee18149e7 test(v0.17.0/A5 follow-up): trigram tokenizer downstream test fixes
trigram tokenizer 가 snippet 단위 + 단어 경계 + BM25 raw score 분포를
모두 바꿔서 unicode61 assumption 기반의 3 test 가 regression.

- wire_search_response::search_json_truncates_with_max_tokens +
  search_plain_emits_truncated_hint_to_stderr: 단일 doc + 작은
  max_tokens 로는 snippet 이 짧아서 budget loop 가 trip 안 함.
  다중 doc fixture (5 doc) + budget 30 token 으로 hit-pop 경로
  통해 truncated=true 보장.
- fetch_integration::fetch_chunk_with_context_returns_neighbors:
  fixture body 의 2-char tokens (A1/A3 등) 가 trigram 비호환으로
  0-hit. apples/banana/cherry/durian/elder 5-char unique words
  로 갱신, query 도 cherry 로 deterministic pin.
- eval/runner::runner_per_query_snapshot_matches_fixture: trigram
  token stream 으로 BM25 raw score 변동. UPDATE_SNAPSHOTS=1 로
  regenerate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:21:34 +00:00
6ac7fea7b9 feat(v0.17.0/A5): trigram-aware build_match_string + SearchResponse.hint
PR-A 본체. plan Task A4 Step 1c + A5.

- lexical.rs::build_match_string 재설계: whole-phrase + token-AND
  OR-combined, 3자 미만 토큰 drop, 후보 없음 시 None (빈 MATCH
  회피). raw single-quote mode 유지.
- SearchResponse.hint additive — empty result + trimmed < 3 chars
  + non-raw 케이스에 short_query_hint helper 가 set.
- CLI 'kebab search' 가 [hint] stderr 한 줄 (text mode).
- TUI SearchState.short_query_hint + poll_worker stale-aware set
  + fire_search/mark_input_changed reset + dynamic_status 표시.
- docs/wire-schema/v1/search_response.schema.json hint additive.
- 신규 unit tests (lexical 9 PASS, 기존 2 expectation 갱신) +
  통합 회귀 (search_korean: multi_token + mixed, 3 PASS) +
  BM25 snapshot regen (trigram token stream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 11:54:25 +00:00
1969c8e3b5 fix(dogfood): k8s multi-resource YAML chunk_id collision
P10 dogfooding found that a k8s manifest with 2+ documents (e.g.
Deployment + Service in one file) fails to ingest:
  UNIQUE constraint failed: chunks.chunk_id

Root cause: tier2_shared::push_chunks_with_oversize's non-oversize branch
hardcoded split_key = None. K8sManifestResourceV1Chunker calls it once per
resource; with split_key None every resource from the same document gets
the same id_hash (= base_policy_hash) → identical chunk_id. p10-3's
code_text_paragraph_v1 had the same bug (fixed in df3c5b8) but it calls
build_chunk_no_symbol directly — the push_chunks_with_oversize path was
never fixed.

Fix: push_chunks_with_oversize gains a base_split_key parameter for the
non-oversize single-chunk case. k8s chunker passes Some(resource.line_start)
so each resource gets a distinct chunk_id; dockerfile / manifest pass None
(1 chunk per file — no sibling collision, chunk_id stays stable).

Regression coverage: k8s_multi_doc_emits_one_chunk_per_resource now asserts
chunk_id distinctness; new integration test
tier2_k8s_multi_resource_yaml_ingests_without_collision ingests a real
2-document YAML end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 23:49:37 +00:00
192835e5bf test(p10-1d): integration smoke tests for C + C++
Verifies end-to-end ingest + search + Citation::Code shape:
- tier1_c_ingest_searchable: .c file → --code-lang c search → symbol
  = function name (no nesting), lang = "c", chunker_version = "code-c-ast-v1".
- tier1_cpp_ingest_searchable: .cpp file → --code-lang cpp search →
  symbol starts with namespace::Class prefix, lang = "cpp",
  chunker_version = "code-cpp-ast-v1".

Brings code_ingest_smoke to 18 tests (Tier 1: 9 → 11, Tier 2: 3,
Tier 3: 4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:31:35 +00:00
1034de25a2 fix(p10-3+p10-1d): land the missing try_skip_unchanged fallback-aware fix
PR #155 (p10-3) merged WITHOUT the reviewer's required Option B1 fix —
the implementer reported a commit SHA (2a39513) that never made it to main.
Result: every reingest of a Tier 3-fallback file (non-k8s YAML, invalid
YAML, AST extractor failure) re-runs full extract + chunk + embed because
the parser/chunker version comparison can never match (stored is
code-text-paragraph-v1 / none-v1, but caller uses Tier 1/2 dispatch
values).

This commit:
1. Adds the 7th param `fallback_chunker_version: Option<&ChunkerVersion>`
   to try_skip_unchanged + the stored_is_tier3_fallback detection branch
   (skip parser/chunker equality, keep embedder check).
2. Threads `None` through non-code call sites (md / image / pdf).
3. Code call site computes tier3_fallback_cv covering all Tier 1/2 langs
   that can fall back: rust / python / ts / js / go / java / kotlin /
   yaml / dockerfile / toml / json / xml / groovy / go-mod / c / cpp
   (p10-1D additions).
4. Adds tier3_yaml_fallback_reingest_is_unchanged + tier3_shell_reingest_is_unchanged
   regression tests (the originally-promised PR #155 regression coverage
   that also never made it to main).

Smoke tests: 14 + 2 = 16 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 14:19:17 +00:00
df3c5b8caf test(p10-3): integration smoke tests for Tier 3 (shell + yaml fallback)
Two new tests verify end-to-end Tier 3 wiring:
- tier3_shell_ingest_searchable: .sh file → --code-lang shell search →
  Citation::Code { symbol: None, lang: "shell" }, chunker_version
  "code-text-paragraph-v1".
- tier3_yaml_fallback_picks_up_non_k8s_yaml: docker-compose-shaped yaml
  (no apiVersion/kind) triggers k8s chunker's Ok(vec![]) result, fallback
  retries with Tier 3 → Citation::Code { symbol: None, lang: "yaml" } and
  chunker_version "code-text-paragraph-v1".

Also fixes a bug in CodeTextParagraphV1Chunker (Task B): short paragraphs
(≤80 lines) were emitted with split_key=None, causing all paragraphs from the
same document to share the same chunk_id (UNIQUE constraint violation at
put_chunks). Fix: always use para.line_start as split_key so every paragraph
gets a distinct id regardless of size.

Brings code_ingest_smoke to 14 tests (Tier 1: 9, Tier 2: 3, Tier 3: 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 11:37:44 +00:00
166e1ddfaf test(p10-2): integration smoke tests for Tier 2 (k8s yaml + Dockerfile + Cargo.toml)
Three new tests in code_ingest_smoke.rs verifying isolated-TempDir ingest +
--code-lang filter + Citation::Code.lang / .symbol shape for each Tier 2
chunker. Brings the suite to 12 tests (Rust 3 + Python 1 + TS 1 + JS 1 +
Go 1 + Java 1 + Kotlin 1 + yaml 1 + dockerfile 1 + manifest 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 13:23:01 +00:00
ff1bedbef5 feat(p10-1c-jk): activate Kotlin in ingest_one_code_asset dispatch
Replaces Kotlin bail! arms with KotlinAstExtractor + CodeKotlinAstV1Chunker.
Adds kotlin_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=kotlin, symbol=com.foo.Foo.bar, code_lang=kotlin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:54:55 +00:00
ebc4ef2eea feat(p10-1c-jk): activate Java in ingest_one_code_asset dispatch
Replaces Java bail! arms with JavaAstExtractor + CodeJavaAstV1Chunker. Adds
java_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=java, symbol=com.foo.Foo.bar, code_lang=java.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:44:05 +00:00
c19aa006d0 feat(p10-1c-go): activate Go in ingest_one_code_asset dispatch
Replaces Go bail! arms with GoAstExtractor + CodeGoAstV1Chunker. Adds
go_file_ingests_and_searches_as_code_citation integration test — asserts
citation.lang=go, symbol=chunk.ParseDoc, code_lang=go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:13:47 +00:00
453ec15df4 fix(dogfood): document-centric fetch_span + remove get_asset_by_workspace_path
assets.workspace_path is INTENTIONALLY 'last-registered path' for twin
files (identical content at different paths share one asset row PK'd by
blake3 content hash). PR #146 made try_skip_unchanged document-centric;
PR #149 made reset --orphans-only document-centric; this PR removes the
last caller of get_asset_by_workspace_path (fetch.rs:193 in fetch_span,
which used it to reject PDF/audio media — for twins this could read the
wrong asset's media_type and pick the wrong branch).

Replaced with the natural 2-step lookup: get_document_by_workspace_path
(PR #146) → doc.source_asset_id → get_asset (NEW trait method, asset_id
is PRIMARY KEY so flip-flop-immune by construction).

Then removed get_asset_by_workspace_path trait method + SqliteStore impl
— 0 callers after the refactor.

UPSERT doc-comment refreshed in store.rs to make the 'last-registered'
semantics explicit so future readers don't try to 'fix' the flip-flop.

Dogfood follow-up (PR #142 1B + multi-root corpus).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:03:38 +00:00
5f2bd9e97e feat(dogfood): kebab reset --orphans-only — purge stored docs outside walker scope
PR #148 auto-purges only filesystem-missing files (conservative — leaves
on-disk-but-out-of-scope docs alone for data safety). This is the explicit
complement: when the user has narrowed include / widened exclude / removed
a sub-directory from the workspace and WANTS the stored docs reconciled,
they invoke 'kebab reset --orphans-only'.

Confirm prompt with orphan count + sample paths; --yes required in
non-TTY. SQLite purge via existing purge_deleted_workspace_path (PR #148)
+ vector store delete_by_chunk_ids when configured. No fs existence
check — orphans-only is the explicit 'I know what I'm doing' variant.

dogfood follow-up to PR #148 (file deletion auto-purge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:38:10 +00:00
27baec82ea fix(dogfood): auto-purge stored docs for filesystem-deleted files
Files deleted from disk (rm a.md) were leaving stale documents + chunks +
embeddings in the store, surfacing as ghost citations in search/ask.
Existing purge_orphan_at_workspace_path only handled content-changed
stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no
new asset_id.

Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths),
for each candidate check filesystem existence — only purge when the
file is TRULY missing. Scope-narrowing case (file on disk but outside
include glob) is explicitly NOT purged to protect users from accidental
data loss via config edits.

Adds:
- DocumentStore::all_workspace_paths trait method + SqliteStore impl
- purge_deleted_workspace_path in store-sqlite (returns chunk_ids for
  vector delete; deletes doc CASCADE + asset row + copied storage file)
- sweep_deleted_files in kebab-app::ingest path; called once per ingest
  before the per-asset loop
- IngestReport.purged_deleted_files counter (additive, serde default)
- CLI ingest summary mentions purge count when > 0
- 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge

dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod
+ lodash). Per user decision: only filesystem deletion auto-purges;
scope narrowing requires explicit kebab reset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:51:07 +00:00
641b92af7d fix(dogfood): document-centric try_skip_unchanged for twin-file idempotency
Identical-content files at different workspace paths share one assets row
(assets.asset_id = blake3 content hash, PRIMARY KEY). The UPSERT
`ON CONFLICT(asset_id) DO UPDATE SET workspace_path = excluded` made
twin files overwrite each other's workspace_path on every ingest, so
`get_asset_by_workspace_path(path1)` returned the OTHER twin's row (or
None) — break idempotent unchanged-detection for both files.

Fix: switch try_skip_unchanged to document-centric lookup. `documents.
workspace_path` is already UNIQUE (V001) and `id_for_doc(path, ...)`
includes path, so each twin has its own stable document row. Compare
`doc.source_asset_id` with the new asset's checksum instead of going
through the assets table.

Dogfood (multi-root: kebab-docs + httpx + zod + lodash) showed 27 of
726 docs marked Updated on every idempotent re-ingest — all 27 are
twin-file victims (empty `__init__.py` ×3, AGENTS.md ↔ CLAUDE.md
same content, duplicate logo PDFs/JPGs).

After: re-ingest reports 0 new / 0 updated / 726 unchanged.

No schema migration needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 05:27:21 +00:00
d53995a6d4 feat(p10-1b): code-js-ast-v1 chunker + activate JavaScript in app dispatch
Chunker: duplicate-with-substitution from code-ts-ast-v1 / code-rust-ast-v1.
Dispatch: replaces JS bail! arms with JavascriptAstExtractor + CodeJsAstV1Chunker.
Integration test javascript_file_ingests_and_searches_as_code_citation asserts
citation.lang=javascript, symbol=src/Bar.Bar.baz, code_lang=javascript.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:16:07 +00:00
31245a4328 fix(p10-1b): TS parser_version code-typescript-v1 → code-ts-v1 (naming consistency)
Task H implementer chose code-typescript-v1 but plan + design §3.3 use the
short form (chunker is code-ts-ast-v1 / code-js-ast-v1). Aligning parser
versions to match: rust=code-rust-v1 / python=code-python-v1 / ts=code-ts-v1
/ js=code-js-v1 (Task K). Fixes 2 sites: const PARSER_VERSION + integration
test assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:05:17 +00:00
acb61b6830 feat(p10-1b): activate TypeScript in ingest_one_code_asset dispatch
Replaces TS bail! arms with TypescriptAstExtractor + CodeTsAstV1Chunker.
Adds typescript_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=typescript, symbol=src/Foo.Foo.bar, code_lang=typescript.
JS arms remain bail!() (Task L).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:59:41 +00:00
1815091247 feat(p10-1b): activate Python in ingest_one_code_asset dispatch
Replaces Python bail! arms with PythonAstExtractor + CodePythonAstV1Chunker.
Adds python_file_ingests_and_searches_as_code_citation integration test —
asserts citation.lang=python, symbol=kebab_eval.metrics.compute_mrr,
code_lang=python. TS/JS arms remain bail!() (Tasks J/L).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:49:01 +00:00
b5d1fe8c1e feat(p10-1a-2): backfill SearchHit.repo from doc metadata (Task 8b)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:13:01 +00:00
580576c2c6 feat(p10-1a-2): wire code ingest dispatch (ingest_one_code_asset)
Add `MediaType::Code("rust")` dispatch arm in `ingest_one_asset`,
`ingest_one_code_asset` fn (faithful mirror of `ingest_one_pdf_asset`),
and `backfill_code_lang` post-processing in `App::search_uncached`.
Integration test `code_ingest_smoke.rs` verifies full pipeline:
ingest `.rs` → Citation::Code hit with lang/symbol/line_start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 20:14:59 +00:00
th-kim0823
231d80e82d feat(stats): media/lang/bytes/stale fields on schema.v1.stats (fb-37)
Extends CountSummary with media_breakdown, lang_breakdown, stale_doc_count
fields populated via stats_ext::breakdowns(). Adds count_summary_with_threshold
for callers that need real stale counts. Mirrors all new fields onto the
wire-bound Stats struct in kebab-app::schema with #[serde(default)] for
backwards-compat. Also fixes search_budget_integration.rs for the trace field
added to SearchOpts in Task 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 12:34:57 +09:00
th-kim0823
b86b763dfb fix(fb-35): address PR #126 round 2 review
- wire schema: relax effective_end.minimum 1 → 0 + expand
  description to cover line-clamp + out-of-range sentinel
  (panic-fix R1 emits Some(0) when line_start=1 and range is
  beyond doc end — schema must accept it)
- tests: tighten first-chunk-target boundary test to assert ≤ 2
  total neighbors (3-chunk doc, N=2). Strict "first chunk →
  context_before empty" not assertable until chunks.ordinal
  column lands (R1 #9 architectural caveat)
- store: trim contradiction in list_chunk_ids_for_doc warning
  comment — drop "good enough for sequentially chunked
  markdown" phrase that conflicts with "hash sort dominates"
  paragraph above

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:55:29 +09:00
th-kim0823
7dddc1d706 fix(fb-35): address PR #126 round 1 review
- fetch_span: panic-fix on line_start > total / empty doc
  (return empty text + effective_end = line_start - 1 instead of
  out-of-bounds slice)
- truncated: reserved for budget-driven truncation only; line
  range clamp signaled via effective_end < line_end
- spec / SKILL.md / README: align rejection wording to "PDF /
  audio" (matches code; Image OCR allowed for span)
- store: warning comment on list_chunk_ids_for_doc — chunk_id
  hash sort does NOT preserve document position; real fix is a
  chunks.ordinal column, tracked as follow-up
- surrounding_chunks: saturating_add to defend against u32::MAX
  context arg on 32-bit targets
- tests: line_start > total returns empty + chunk context at
  doc boundary clamps lower bound

Deferred nits (follow-up): table-separator strict CommonMark form;
MCP per-mode strict validation; CLI chunk_id truncation in plain
output. None block correctness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:45:29 +09:00
th-kim0823
1b9d89eb3a feat(app): App::fetch span mode + PDF/audio rejection (fb-35)
Line-based slice over fmt_canonical_to_markdown output.
PDF / audio source_type → span_not_supported StructuredError.
Out-of-range line_end clamps to total; effective_end reflects
post-budget trim. invalid_input on zero / inverted bounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:54:22 +09:00
th-kim0823
7d1f855f7e feat(app): App::fetch doc mode with budget (fb-35)
Walks CanonicalDocument blocks, serializes to markdown, applies
chars/4 budget when opts.max_tokens is set. doc_not_found
preserved through StructuredError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:48:40 +09:00
th-kim0823
610d29f053 feat(app): App::fetch chunk mode + markdown serializer (fb-35)
Chunk mode + +-N context. doc / span modes return placeholder
errors (filled by subsequent tasks). fmt_canonical_to_markdown
helper introduced now since doc mode (Task 4) consumes it.
Errors are typed StructuredError so classify preserves
chunk_not_found / doc_not_found through the wire layer.

Adds SqliteStore::list_chunk_ids_for_doc so the facade can derive
+-N neighbors without leaking direct rusqlite usage into kebab-app.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:44:51 +09:00
th-kim0823
e084b306e5 fix(fb-34): align next_cursor semantics with docs (PR #125 round 2)
Previous round-1 fix dropped the speculative cursor branch on
the truncated path, leaving a contradiction with the docs:
- snippet-only shrunk → cursor emitted (returned == k_effective)
- k-popped → cursor null (returned < k_effective)
But docs promised the opposite.

R2 resolution: emit cursor whenever more hits may be reachable
(either retriever filled the page OR budget popped hits — the
popped ones remain fetchable from offset+returned). Drop the
artificial "widen vs paginate" copy; truncated and next_cursor
are now independent signals — caller may do either or both.

Updates: app.rs::search_with_opts logic + SearchResponse doc +
schema description + SKILL.md two bullets + max_tokens=0 test
asserts cursor IS emitted on k-pop case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 21:07:04 +09:00
th-kim0823
f485608108 fix(fb-34): address PR #125 round 1 review
- error_wire: StructuredError wrapper preserves ErrorV1 through
  anyhow → classify pipeline. Adds downcast short-circuit so
  cursor::decode's typed code = "stale_cursor" reaches the wire
  instead of being string-formatted to code = "generic".
- app: search_with_opts now wraps cursor::decode error in
  StructuredError instead of anyhow! string format.
- test: error_wire pins both negative (bare anyhow → not
  stale_cursor) AND positive (StructuredError → stale_cursor)
  invariants. CLI integration test runs end-to-end and asserts
  error.v1.code on stderr.
- app: next_cursor only emitted on full-page (k-pop) path; drop
  speculative emit on snippet-only truncation that would point at
  a different page than the agent expected.
- cursor: differentiate malformed-base64 / malformed-payload /
  revision-mismatch error messages; all keep code = stale_cursor.
- test: cursor_rejected fixture uses .expect() to fail loud on
  cursor non-emission instead of silent skip.
- test: max_tokens=0 → 1-hit floor + truncated=true.
- docs: SKILL.md + schema description distinguish snippet-shrink
  (widen) vs k-pop (paginate) truncated cases. HOTFIXES notes
  --no-cache semantic shift (cached path + clear vs uncached path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:49:27 +09:00
th-kim0823
af80cedd81 feat(app): App::search_with_opts + SearchResponse (fb-34)
Budget loop: snippet shorten → k pop → ≥1 hit floor. Cursor
encode/decode threads corpus_revision; mismatch surfaces as
stale_cursor anyhow error. App::search retained as thin wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:59:48 +09:00
th-kim0823
ebbc3a46ae feat(app): cursor encode/decode for paginated search (fb-34)
Opaque base64(JSON{offset, corpus_revision}). Mismatch or
malformed input returns ErrorV1 with code = stale_cursor.
base64 promoted to workspace dep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:49:23 +09:00
th-kim0823
dfef65f196 feat(app): staleness module + post-process search hits (fb-32)
compute_stale: strict > boundary, threshold=0 disables, future
timestamps treated as fresh (clock skew safety). App::search
re-stamps on cache hit so config threshold changes take effect
without flushing the cache.

Also unblocks the workspace build by plugging placeholder
indexed_at/stale into the two AnswerCitation construction
sites in kebab-rag/pipeline.rs (the score-gate refusal path
forwards from SearchHit; the LLM-citation path uses
UNIX_EPOCH/false until Task 7 wires the real values through
pack_context).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:30:10 +09:00
th-kim0823
7f5739d8fb 🏗️ refactor(fb-31): apply round 1 review nits
- ingest_file_with_config: lowercase normalize ext (caller-side) +
  early error on unsupported extension (`.docx` etc. now Err with
  helpful message instead of silent skipped_by_extension counter).
  New test ingest_file_errors_on_unsupported_extension.
- ingest_stdin_with_config: doc comment explaining intentional
  double-call of ensure helpers (idempotent + ~ms negligible).
- external::inject_frontmatter: simplify precheck via single
  trim_start binding + add CR-only line ending edge case.
- external::inject_frontmatter: doc note on yaml_quote escape
  contract (agent-supplied titles with special chars are safe).

Round 1 review summary: #111 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:46:55 +09:00
th-kim0823
a42f907640 🧪 test(kebab-app): ingest_stdin_with_config integration (fb-31)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:04:52 +09:00