Files
kebab/tasks/p3/p3-5-app-wiring.md
altair823 f9714aa5cb docs(rename): kb → kebab — README, tasks/, docs/, design doc, report
마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신.

- 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈
  path 표기 `kb_*` → `kebab_*` 포함).
- 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`,
  `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`,
  `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때
  같은 prefix 사용).
- CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` /
  `kb doctor` / `kb inspect` / `kb list` / `kb eval` →
  `kebab <verb>`. fenced code block + 인라인 backtick 모두.
- XDG paths + env vars + binary 경로 (`target/release/kb` →
  `target/release/kebab`) 동기화.
- design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task
  spec 모든 reference 통일.
- task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history
  기록용 author 정보라 그대로 유지 (실제 git history 의 author 는
  변경 불가).
- `tasks/phase-5-evaluation.md` 의 `status: planned` →
  `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분).

## 검증

- `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]"
   --include="*.md"` 0 hits (task-decomposition.md 의 git author
  제외).
- 모든 file path reference 살아있음 (renamed file 들 모두 새 path
  로 update).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 04:01:55 +00:00

173 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: P3
component: kebab-app (facade wiring)
task_id: p3-5
title: "Wire kebab-app facade — ingest / search / list / inspect end-to-end"
status: completed
depends_on: [p1-6, p2-2, p3-2, p3-3, p3-4]
unblocks: [p4-3, p9-1, p9-2, p9-4]
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
contract_sections: [§7 components, §7.2 traits, §1 UX scenes (ingest, search, list, inspect), §6.4 config, §10 errors]
---
# p3-5 — Wire kebab-app facade
## Goal
Replace the `bail!("not yet wired")` stubs in `kebab-app` (currently `ingest`, `search`, `list_docs`, `inspect_doc`, `inspect_chunk`) with real bodies that compose the libraries shipped in P1P3. After this task, `kebab index` actually walks a workspace and persists chunks, and `kebab search --mode {lexical,vector,hybrid}` returns real `SearchHit`s. `kebab-app::ask` stays stubbed (P4-3 owns it).
## Why now / why this size
P1P3 shipped libraries but the CLI is unusable: every facade method bails. Inserting this glue task between P3-4 (last library that completes the retrieval stack) and P4-1 (LLM trait, where `kebab-app::ask` will need this same wiring anyway) lets the user run the tool end-to-end for the first time and validates that the library boundaries actually compose. Limiting it to non-LLM commands keeps it a single-session task; `ask` wiring lives in P4-3.
## Allowed dependencies
`kebab-app` may depend on:
- `kebab-core`, `kebab-config` (already)
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize`, `kebab-chunk`, `kebab-store-sqlite` (P1)
- `kebab-search` (P2-2, P3-4)
- `kebab-store-vector`, `kebab-embed`, `kebab-embed-local` (P3-2, P3-3, P3-4)
- `tracing`, `anyhow`, `serde`, `serde_json`, `time`, `dirs`, `toml` (existing)
## Forbidden dependencies
- `kebab-llm*`, `kebab-rag` (P4 ownership — `ask` stays stubbed)
- `kebab-tui`, `kebab-desktop` (P9 — those *consume* `kebab-app`, not the other way)
- `kebab-parse-pdf`, `kebab-parse-image`, `kebab-parse-audio` (P6/P7/P8)
- network HTTP libs (model download is `kebab-embed-local`'s responsibility via fastembed)
## Inputs
| input | type | source |
|-------|------|--------|
| `kebab-config::Config` | `kebab_config::Config` | loaded once at process start; threaded through every facade call |
| `SourceScope` | `kebab_core::SourceScope` | CLI |
| `SearchQuery` | `kebab_core::SearchQuery` | CLI |
| `DocFilter` | `kebab_core::DocFilter` | CLI |
| `DocumentId` / `ChunkId` | `kebab_core::ids::*` | CLI |
## Outputs
| output | type | downstream |
|--------|------|------------|
| `IngestReport` | `kebab_core::IngestReport` | `kebab-cli` (`ingest_report.v1`), `kebab-eval` (P5) |
| `Vec<SearchHit>` | `kebab_core::SearchHit` | `kebab-cli` (`search_hit.v1`), `kebab-rag` (P4-3) |
| `Vec<DocSummary>` | `kebab_core::DocSummary` | `kebab-cli` (`doc_summary.v1`) |
| `CanonicalDocument` / `Chunk` | `kebab_core::*` | `kebab-cli` (`chunk_inspection.v1`) |
## Public surface (signatures only — no new types)
The signatures already live on `kebab-app`. This task only swaps the bodies. Frozen surface:
```rust
pub fn ingest(scope: SourceScope, summary_only: bool) -> anyhow::Result<IngestReport>;
pub fn list_docs(filter: DocFilter) -> anyhow::Result<Vec<DocSummary>>;
pub fn inspect_doc(id: &DocumentId) -> anyhow::Result<CanonicalDocument>;
pub fn inspect_chunk(id: &ChunkId) -> anyhow::Result<Chunk>;
pub fn search(query: SearchQuery) -> anyhow::Result<Vec<SearchHit>>;
// `ask` and `init_workspace` / `doctor` stay as-is.
```
If a constructor helper is needed (e.g., a single `App::open(&Config)` that opens `SqliteStore`, `LanceVectorStore`, embedder, retrievers once), it MAY be added pub-or-pub(crate) to `kebab-app`. Keep all newly-added types crate-private unless explicitly required by `kebab-cli`.
## Behavior contract
### `ingest(scope, summary_only)`
Pipeline per design §1.2:
1. Resolve `scope` → set of files under `config.workspace.root` filtered by `config.workspace.include`/`exclude`. Use `kebab-source-fs::FsSourceConnector` and pass each `RawAsset` + bytes through.
2. For each markdown asset: `kebab_parse_md::frontmatter::parse_frontmatter` + `kebab_parse_md::blocks::parse_blocks``kebab_normalize::build_canonical_document``kebab_chunk::MdHeadingV1Chunker::chunk`.
3. SQLite writes per design §5.8 (one ingest = one transaction): `put_asset_with_bytes``put_document``put_blocks``put_chunks`.
4. If `config.models.embedding.provider == "fastembed"`: build `FastembedEmbedder` once at the top of the run, embed every chunk's `text` as `EmbeddingKind::Document`, batch by `config.models.embedding.batch_size`, then `LanceVectorStore::ensure_table` + `upsert`. Skip vector indexing if provider is `"none"` or `embedding.dimensions == 0` (config opt-out path).
5. Record an `ingest_runs` row via `JobRepo` with the aggregate counts. `summary_only=true` writes `items_json=NULL`.
6. Return `IngestReport` per `wire-schema/v1/ingest_report.schema.json`.
Errors: any per-file parse failure should be recorded as a `Warning` in `Provenance` and skipped, not propagated. Only structural failures (DB unreachable, FS permission denied) abort the whole run.
### `search(query)`
Per `query.mode`:
- `Lexical``kebab_search::LexicalRetriever::search`. Takes a read-only borrow of the SQLite connection.
- `Vector``kebab_search::VectorRetriever::search`. Requires `Embedder` (built from config) + `VectorStore` (LanceDB).
- `Hybrid``kebab_search::HybridRetriever::search` composing the above two.
Each call constructs (or reuses) the retrievers from a single `App` context that holds `Arc<SqliteStore>`, `Arc<LanceVectorStore>`, `Arc<dyn Embedder>` so cold-start cost is paid once per process. Document the lifetime model: a CLI invocation paths through `App::open(&Config)` once, runs the requested op, then drops everything on exit. The TUI (P9) holds the `App` for the session.
`SearchHit.embedding_model` set to `Some(embedder.model_id())` for Vector / Hybrid modes; `None` for Lexical-only. `SearchHit.index_version` follows the retriever's reported `index_version()`.
### `list_docs(filter)` / `inspect_doc(id)` / `inspect_chunk(id)`
Direct delegation to `kebab_core::DocumentStore` trait methods on `SqliteStore`. No vector / embedding involvement.
### Lifecycle
Define an internal `pub(crate) struct App { config: Arc<Config>, sqlite: Arc<SqliteStore>, vector: Option<Arc<LanceVectorStore>>, embedder: Option<Arc<dyn Embedder + Send + Sync>>, /* retrievers built per call */ }` with a `pub(crate) fn open(config: &Config) -> Result<Self>`. Each public `kebab-app::*` function builds an `App` (or accepts one in tests) and uses it. The free functions stay the public API so `kebab-cli` and `kebab-tui` don't need to refactor.
`vector` and `embedder` are `Option` because the spec allows running KB without embeddings (lexical-only mode for resource-constrained boxes — `provider == "none"`). When absent, `search` with `Vector` / `Hybrid` modes returns `anyhow::Error` with a hint to enable embeddings; `ingest` skips the vector step.
### Versioning
- `parser_version` from `kebab-parse-md` constant.
- `chunker_version` from `MdHeadingV1Chunker::chunker_version()`.
- `embedding_model` / `embedding_version` from the embedder.
- `index_version` from the retriever (or `Lexical`/`Vector` IndexId).
All wired through to the persisted records and the wire output.
## Storage / wire effects
- Writes: `data_dir/kebab.sqlite` (assets, documents, blocks, chunks, embedding_records, jobs/ingest_runs), `data_dir/assets/<aa>/<id>` (when copied), `data_dir/lancedb/chunk_embeddings_<model>_<dim>.lance/` (when embeddings on), `data_dir/models/fastembed/` (model cache, first run only).
- Reads on subsequent calls: same DB.
- Wire JSON conforms to `ingest_report.v1`, `search_hit.v1`, `doc_summary.v1`, `chunk_inspection.v1``kebab-cli` already has the wrappers.
## Test plan
| kind | description | fixture / data |
|------|-------------|----------------|
| integration | `ingest` over a 3-file fixture workspace produces non-empty `IngestReport` and writes the expected SQLite rows + Lance rows | `crates/kebab-app/tests/fixtures/workspace/`, tmp `data_dir` |
| integration | `ingest` with `provider="none"` skips Lance and produces an `IngestReport` with `embeddings_indexed=0` | tmp config |
| integration | `ingest` is idempotent: re-running on the same workspace updates `documents.updated_at` and bumps `doc_version` without duplicating rows | tmp `data_dir` |
| integration | `search` with `mode=Lexical` returns hits whose `embedding_model` is `None` | tmp `data_dir` |
| integration | `search` with `mode=Vector` returns hits whose `embedding_model` matches the configured model | tmp `data_dir` (AVX `#[ignore]`) |
| integration | `search` with `mode=Hybrid` returns hits whose `RetrievalDetail.method == Hybrid` | tmp `data_dir` (AVX `#[ignore]`) |
| integration | `search` with `mode=Vector` and `provider="none"` returns a clear error | tmp config |
| integration | `list_docs` with `tags_any=["rust"]` filters correctly | tmp `data_dir` |
| integration | `inspect_doc` round-trips a document; `inspect_chunk` round-trips a chunk | tmp `data_dir` |
| smoke | end-to-end CLI smoke: `kebab index` + `kebab search --mode hybrid "..."` against the fixture workspace using `cargo run` (manual / `assert_cmd`) | fixture |
`#[ignore]` AVX-gated tests follow the P3-3/P3-4 pattern: `require_avx_or_panic()` at the top of each LanceDB-touching body. Default `cargo test -p kebab-app` runs the lexical-only and SQLite-only paths.
All tests under `cargo test -p kebab-app`. CLI smoke optional via `assert_cmd` if it doesn't add a heavyweight dep tree.
## Definition of Done
- [ ] `cargo check -p kebab-app` passes.
- [ ] `cargo test -p kebab-app` passes (default lane).
- [ ] `cargo test -p kebab-app -- --ignored` passes on AVX hardware.
- [ ] `cargo run -p kebab-cli -- index` succeeds against a non-empty fixture workspace.
- [ ] `cargo run -p kebab-cli -- search --mode hybrid "<term>"` returns real hits + citations.
- [ ] `cargo run -p kebab-cli -- list` returns the indexed documents.
- [ ] `cargo clippy --workspace --all-targets -- -D warnings` clean.
- [ ] No imports outside Allowed dependencies.
- [ ] PR links design §1.2, §1.5, §1.6, §7.
## Out of scope
- `kebab-app::ask` body — P4-3 (RAG pipeline).
- `kebab index --rebuild-fts` CLI option — wire `kebab_store_sqlite::rebuild_chunks_fts` only when needed by a downstream task.
- `kebab index --resume` checkpointing — design §1.2 mentions it but spec leaves to a P+ refinement; document but defer.
- `--watch` mode — P+.
- TUI / desktop integration — P9 consumes the wired facade.
## Risks / notes
- Cold-start cost: first `kebab index` run downloads the fastembed model (~470MB) and warms the ONNX session. Surface via `tracing::info!` (already wired in P3-2).
- **Post-merge fix (2026-05)**: original P3-5 implementation left every `kebab-cli` subcommand calling the bare `kebab_app::*` free functions, which silently re-loaded `Config::load(None)` (XDG default) and ignored the user's `--config <path>` flag. CLI now goes through `kebab_app::*_with_config(cfg, ...)` everywhere. The `#[doc(hidden)] pub fn *_with_config` companions originally framed as "test seam" are now the official config-explicit API for CLI / TUI / integration tests. See [HOTFIXES.md](../HOTFIXES.md) for details.
- The `App` lifecycle struct is internal but its construction is the natural seam for adding caching / connection pooling later. Keep it `pub(crate)` so future refactors don't break the CLI.
- Mismatched `index_version` across stored records and the live retriever should fail loud at `App::open` (not at first search). Reuse the `tracing::warn!` from `HybridRetriever::new` (P3-4).
- The fastembed adapter holds a `tokio::runtime` (P3-3); `App` must be constructed from a synchronous context. Document on `App::open`.
- Performance: `ingest` over a large workspace (10k files) needs to keep SQLite WAL in healthy shape. One transaction per document is the spec; verify checkpoint behavior under load (manual benchmark, not a unit test).