Files
kebab/tasks/p1/p1-6-store-sqlite.md
altair823 d1b99b2994 docs: mark P0–P4 done, add SMOKE recipe, refresh README
State drift after P0–P4 completion + 3 post-merge hotfixes (PR #20
--config across subcommands, PR #24 --config in kb ask, PR #25 RRF
fusion_score normalization). README still framed the project as
"spec frozen, code 0 lines"; phase docs and task specs all carried
status: planned. Sweep:

- README.md: top banner now "P0–P4 done (17/31 tasks) + 3 hotfixes
  applied"; command table marks each subcommand's owning phase and
  current status (kb ask =  via P4-3, kb eval =  P5);
  phase roadmap table grew a Status column (P0–P4 completed, P5
  next, P6–P9 pending); component count bumped 30 → 31 to reflect
  P3-5 (app-wiring, post-spec); core decisions table notes the
  RRF [0,1] normalization invariant; build+실행 section drops the
  "P0 미시작" caveat; new pointers to HOTFIXES.md and SMOKE.md.
- docs/SMOKE.md (new): ~80-line recipe for running the full
  pipeline against an isolated /tmp/kb-smoke/ workspace via
  --config, without polluting ~/.config/kb/ or
  ~/.local/share/kb/. Covers fixture seeding, sample config.toml
  with the post-merge defaults, doctor → ingest → list →
  search × 3 modes → inspect → ask sequence, verification
  checklist, and known-behaviour notes (fastembed model
  download, RAG response time, --config hard-fail on missing
  path).
- tasks/phase-{0..4}-*.md: status frontmatter flipped planned →
  completed.
- tasks/p0/, tasks/p1/, tasks/p2/, tasks/p3/, tasks/p4/: same
  status flip across all 17 component task specs (1+6+2+5+3).
  P5–P9 stay planned.

cargo test --workspace: 319 passed; clippy clean (no source
changes in this commit, just docs + frontmatter).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:32:28 +00:00

6.5 KiB
Raw Blame History

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P1 kb-store-sqlite (P1 subset) p1-6 SQLite store: assets/documents/blocks/chunks + asset writer + migrations completed
p1-1
p1-4
p1-5
p2-1
p3-3
p4-3
../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
§5 DDL (5.1
5.2
5.3
5.4
5.5 chunks only — FTS handled in p2-1)
§5.7 jobs/ingest_runs
§5.8 transactions
§6.3 data_dir layout

p1-6 — SQLite store (P1 subset)

Goal

Persist RawAsset, CanonicalDocument, Blocks, Chunks into SQLite per design §5; copy raw asset bytes into data_dir/assets/<aa>/<asset_id> (or reference if larger than threshold); record an ingest_runs row.

Why now / why this size

P1's terminal task. Closes the loop walk → parse → chunk → store. The FTS5 virtual table and triggers are intentionally deferred to p2-1 to keep this task focused on the relational schema and asset I/O.

Allowed dependencies

  • kb-core
  • kb-config
  • rusqlite (with bundled-sqlcipher disabled; use bundled feature)
  • refinery for migrations
  • serde_json
  • time
  • blake3 (asset copy verification)
  • tracing
  • thiserror

Forbidden dependencies

  • kb-source-fs (only types via kb-core), kb-parse-md, kb-normalize, kb-chunk (only types via kb-core), kb-store-vector, kb-embed*, kb-search, kb-llm*, kb-rag, kb-tui, kb-desktop

Inputs

input type source
migrations migrations/V001__init.sql repo
RawAsset + bytes (RawAsset, Vec<u8>) p1-1 + reader
CanonicalDocument kb_core::CanonicalDocument p1-4
Vec<Chunk> kb_core::Chunk p1-5
IngestRun aggregates (scope, counts, duration) kb-app

Outputs

output type downstream
data_dir/kb.sqlite rows in assets, documents, blocks, chunks, document_tags, ingest_runs, jobs, schema_meta, migrations every later phase
data_dir/assets/<aa>/<asset_id> bytes (when copied) future re-extraction, integrity verification
IngestReport (wire schema v1) kb_core::IngestReport kb-cli, eval

Public surface (signatures only — no new types)

pub struct SqliteStore { /* internal */ }

impl SqliteStore {
    pub fn open(config: &kb_config::Config) -> anyhow::Result<Self>;
    pub fn run_migrations(&self) -> anyhow::Result<()>;

    pub fn put_asset_with_bytes(&self, asset: &kb_core::RawAsset, bytes: &[u8]) -> anyhow::Result<()>;
}

impl kb_core::DocumentStore for SqliteStore {
    fn put_asset(&self, a: &kb_core::RawAsset) -> anyhow::Result<()>;
    fn put_document(&self, d: &kb_core::CanonicalDocument) -> anyhow::Result<()>;
    fn put_blocks(&self, doc: &kb_core::DocumentId, blocks: &[kb_core::Block]) -> anyhow::Result<()>;
    fn put_chunks(&self, doc: &kb_core::DocumentId, chunks: &[kb_core::Chunk]) -> anyhow::Result<()>;
    fn get_document(&self, id: &kb_core::DocumentId) -> anyhow::Result<Option<kb_core::CanonicalDocument>>;
    fn get_chunk(&self, id: &kb_core::ChunkId) -> anyhow::Result<Option<kb_core::Chunk>>;
    fn list_documents(&self, filter: &kb_core::DocFilter) -> anyhow::Result<Vec<kb_core::DocSummary>>;
}

impl kb_core::JobRepo for SqliteStore { /* per design §7.2 signatures */ }

Behavior contract

  • DDL: migrations/V001__init.sql ships exactly the SQL in design §5.1, §5.2, §5.3, §5.4, §5.5 (chunks table only — FTS table & triggers come in p2-1 as V002), §5.7 jobs/ingest_runs/answers/eval_runs/eval_query_results, §5.6 embedding_records.
  • Pragmas at open: foreign_keys=ON, journal_mode=WAL, synchronous=NORMAL, temp_store=MEMORY.
  • One ingest of one document = one transaction (BEGIN..COMMIT). Partial failures roll back; warnings are not failures.
  • Bulk ingest commits per-document.
  • Asset writer:
    • if asset.byte_len <= storage.copy_threshold_mb * 1_048_576: write bytes to assets_dir/<asset_id[..2]>/<asset_id> (mode 0o644), record storage_kind='copied'.
    • else: do not copy; record storage_kind='reference' with storage_path = asset.source_uri's file path.
    • In either case, recompute blake3 of the source bytes once on write/verify and store in assets.checksum. Mismatch → return StoreError::Conflict.
  • Idempotency: re-ingesting the same (workspace_path, asset_id, parser_version) updates documents.updated_at, increments doc_version, replaces blocks/chunks. No row duplication.
  • document_tags: re-derived from Metadata.tags on each put.
  • ingest_runs.items_json is null when caller passes summary_only=true.
  • All wire JSON returned (IngestReport) conforms to docs/wire-schema/v1/ingest_report.schema.json. Fail loudly if schema not present (caller must vendor it).

Storage / wire effects

  • Writes: kb.sqlite (multiple tables), data_dir/assets/<aa>/<asset_id> (copied case).
  • Reads on subsequent calls: same DB.

Test plan

kind description fixture / data
migration fresh DB after run_migrations has all P1 tables and indexes tmp dir
unit put_asset_with_bytes copy mode writes file with correct mode and bytes tmp dir
unit put_asset_with_bytes reference mode does not write file but records path tmp dir + large fake size
unit checksum mismatch returns Conflict error tmp dir + tampered bytes
unit put_document idempotency: same input twice → 1 row, doc_version bumped tmp dir
unit put_blocks + put_chunks transactional rollback on simulated failure tmp dir
contract DocumentStore trait round-trip for fixture document fixtures/markdown/code-and-table.md
snapshot IngestReport JSON for fixture run fixture

All tests under cargo test -p kb-store-sqlite with no network.

Definition of Done

  • cargo check -p kb-store-sqlite passes
  • cargo test -p kb-store-sqlite passes
  • migration V001__init.sql matches design §5 verbatim (diff-checked in CI)
  • Writes to ~/.local/share/kb/ are gated by kb-config's data_dir and never escape it
  • No imports outside Allowed dependencies
  • PR links design §5

Out of scope

  • FTS5 virtual table and triggers (p2-1).
  • Vector store (p3-3).
  • Embedding records writer (p3-2).
  • Search queries (p2-2).

Risks / notes

  • WAL mode requires careful test cleanup: tests must drop the connection before removing kb.sqlite-wal / -shm.
  • Asset directory shard prefix uses asset_id[..2]; using asset_id[..1] would create at most 16 dirs (insufficient).