Files
kebab/tasks/p1/p1-1-source-fs.md
altair823 d1b99b2994 docs: mark P0–P4 done, add SMOKE recipe, refresh README
State drift after P0–P4 completion + 3 post-merge hotfixes (PR #20
--config across subcommands, PR #24 --config in kb ask, PR #25 RRF
fusion_score normalization). README still framed the project as
"spec frozen, code 0 lines"; phase docs and task specs all carried
status: planned. Sweep:

- README.md: top banner now "P0–P4 done (17/31 tasks) + 3 hotfixes
  applied"; command table marks each subcommand's owning phase and
  current status (kb ask =  via P4-3, kb eval =  P5);
  phase roadmap table grew a Status column (P0–P4 completed, P5
  next, P6–P9 pending); component count bumped 30 → 31 to reflect
  P3-5 (app-wiring, post-spec); core decisions table notes the
  RRF [0,1] normalization invariant; build+실행 section drops the
  "P0 미시작" caveat; new pointers to HOTFIXES.md and SMOKE.md.
- docs/SMOKE.md (new): ~80-line recipe for running the full
  pipeline against an isolated /tmp/kb-smoke/ workspace via
  --config, without polluting ~/.config/kb/ or
  ~/.local/share/kb/. Covers fixture seeding, sample config.toml
  with the post-merge defaults, doctor → ingest → list →
  search × 3 modes → inspect → ask sequence, verification
  checklist, and known-behaviour notes (fastembed model
  download, RAG response time, --config hard-fail on missing
  path).
- tasks/phase-{0..4}-*.md: status frontmatter flipped planned →
  completed.
- tasks/p0/, tasks/p1/, tasks/p2/, tasks/p3/, tasks/p4/: same
  status flip across all 17 component task specs (1+6+2+5+3).
  P5–P9 stay planned.

cargo test --workspace: 319 passed; clippy clean (no source
changes in this commit, just docs + frontmatter).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:32:28 +00:00

4.2 KiB
Raw Blame History

phase, component, task_id, title, status, depends_on, unblocks, contract_source, contract_sections
phase component task_id title status depends_on unblocks contract_source contract_sections
P1 kb-source-fs p1-1 Local filesystem source connector completed
p0-1
p1-2
p1-3
p1-4
p1-5
p1-6
../../docs/superpowers/specs/2026-04-27-kb-final-form-design.md
§3.3
§6.2
§6.6
§7.1
§7.2 SourceConnector
§8

p1-1 — Local filesystem source connector

Goal

Walk the workspace root, apply gitignore-style filters, compute BLAKE3 checksums, and produce Vec<RawAsset>.

Why now / why this size

SourceConnector is the entry point of every ingest. Stable RawAsset output unblocks every downstream P1 task (parser, normalize, chunk, store). Small enough to deliver in one PR with full test coverage.

Allowed dependencies

  • kb-core
  • kb-config
  • ignore (gitignore semantics)
  • blake3
  • walkdir
  • time
  • serde
  • thiserror
  • tracing

Forbidden dependencies

  • kb-parse-*, kb-normalize, kb-chunk, kb-store-*, kb-embed*, kb-search, kb-llm*, kb-rag, kb-tui, kb-desktop

Inputs

input type source
SourceScope kb_core::SourceScope kb-app from config
filesystem &Path OS
.kbignore text file workspace root, optional

Outputs

output type downstream consumer
Vec<RawAsset> kb_core::RawAsset kb-parse-md, asset writer in kb-store-sqlite (via kb-app)

Public surface (signatures only — no new types)

pub struct FsSourceConnector { /* internal */ }

impl FsSourceConnector {
    pub fn new(config: &kb_config::Config) -> anyhow::Result<Self>;
}

impl kb_core::SourceConnector for FsSourceConnector {
    fn scan(&self, scope: &kb_core::SourceScope) -> anyhow::Result<Vec<kb_core::RawAsset>>;
}

Behavior contract

  • POSIX-normalize every emitted workspace_path (NFC, leading ./ stripped, single /).
  • asset_id derived per design §4.2 from blake3(raw bytes) full hex.
  • media_type selected from extension + libmagic-like sniff fallback (.md → Markdown, others fall through to MediaType::Other).
  • discovered_at = current OffsetDateTime::now_utc() at scan time.
  • Combine config.workspace.exclude .kbignore for filter (union; ordering does not matter).
  • Symbolic links: follow once, detect cycles via canonicalize + visited set.
  • Files larger than storage.copy_threshold_mb MB → emit AssetStorage::Reference { path, sha } (do not copy bytes here; copying is done by the asset writer task).
  • Idempotent: same input → same Vec<RawAsset> (sort by workspace_path).

Storage / wire effects

  • Reads: filesystem under config.workspace.root.
  • Writes: nothing. (Asset copy is handled by the asset writer in kb-store-sqlite.)

Test plan

kind description fixture / data
unit POSIX path normalization inline cases incl. ./a/b.md, a//b.md, a/b.md → identical
unit blake3 of known bytes matches expected hex inline
unit gitignore filter (*.tmp, node_modules/**) excludes correctly tmp tree built in test
unit .kbignore config exclude works tmp tree
unit symlink cycle does not loop tmp tree with a -> b -> a
snapshot Vec<RawAsset> serialized JSON for fixture tree is stable fixtures/source-fs/tree-1
determinism re-running scan twice produces byte-identical JSON fixtures/source-fs/tree-1

All tests run under cargo test -p kb-source-fs with no network and no model.

Definition of Done

  • cargo check -p kb-source-fs passes
  • cargo test -p kb-source-fs passes
  • Snapshot test fixtures/source-fs/tree-1 round-trips deterministically
  • No imports outside Allowed dependencies (verified via cargo tree -p kb-source-fs)
  • PR description links to design §3.3, §6.2, §7.2

Out of scope

  • File watching (P+).
  • Asset copy/reference storage on disk (kb-store-sqlite task p1-6).
  • Non-fs source connectors (HTTP, S3 — P+).

Risks / notes

  • BLAKE3 of large files (>1 GB) is fast but allocate streaming; do not load whole file in memory.
  • macOS resource forks / .DS_Store should be excluded by default.