- walker.rs: document why we pick walkdir over ignore::WalkBuilder
(explicit canonical-path comparison for sibling-subtree symlinks).
- walker.rs: log canonicalize failures via tracing::debug! (was a silent
`Err(_) => continue`) so broken/permission-denied symlink targets are
observable at debug verbosity.
- connector.rs: TODO marker on the scope.include debug-log noting the
filter belongs at the extractor router (P1-2/P1-3).
- connector.rs: TODO marker on expand_tilde to hoist tilde + ${VAR}
expansion into a kb-config helper once available.
- connector.rs: comment on the .kbignore read documenting the
re-read-on-every-scan() contract.
- connector.rs test: tighten the `.kbignore`-itself ADR comment and
upgrade the assertion to actively pin "`.kbignore` IS emitted" instead
of "either is fine"; future drift will now fail the test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Document AssetStorage::Copied / Reference path semantics so P1-6 (asset
writer) knows that at scan time `Copied.path` is the SOURCE path and the
writer is responsible for both copying bytes AND overwriting `path`
with the destination.
- Rename the dangling-symlink test to make its scope explicit
(`dangling_symlink_pseudo_cycle_does_not_crash`); the prior name implied
a real two-step directory cycle but the targets were broken links.
- Add `two_step_directory_cycle_visited_set_breaks_loop`: builds
`a/loop -> ../b` and `b/loop -> ../a` over real directories with real
files, asserting scan terminates with a finite, deterministic asset
list — exercises the canonical-path visited-set in walker::walk_files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fixtures/source-fs/tree-1/:
README.md
notes/alpha.md
notes/beta.md
ignored/skip.tmp (excluded by .kbignore *.tmp)
.kbignore ("*.tmp")
.DS_Store (implicitly excluded by FsSourceConnector)
The committed baseline (tree-1.snapshot.json) has discovered_at,
source_uri.value, and stored.path replaced with "<stripped>" so the
JSON is portable across checkout locations and CI runs. The test
applies the same stripping to scan output before comparing.
The determinism test runs scan twice and asserts byte-identical
serialized JSON (post-strip) — same filesystem state must yield the
same Vec<RawAsset>.
Regenerate baseline with `KB_REGEN_SNAPSHOT=1 cargo test -p kb-source-fs
--test snapshot_tree1 -- tree_1_snapshot_matches_baseline`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cases:
- root/notes -> root (single-link cycle through workspace root).
- root/a -> b, root/b -> a (two-step cycle of dangling symlinks).
Both must complete in O(seconds) and surface alpha.md exactly once.
Proves walker::walk_files's visited-set guard catches realistic cycle
shapes via the public SourceConnector API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk config.workspace.root, apply gitignore-style filters
(config.workspace.exclude ∪ .kbignore ∪ baked-in defaults for
.DS_Store / ._*), stream BLAKE3 over each file, and emit a
deterministic Vec<RawAsset> sorted by workspace_path.
Modules:
- hash: streaming blake3::Hasher + 64 KiB read buffer (no whole-file
loads); pinned digests for empty input and "hello world".
- media: extension → MediaType (markdown/pdf/image/audio/other).
- walker: ignore::OverrideBuilder for filter union; walkdir with
manual visited-set cycle protection on top of follow_links.
- connector: public FsSourceConnector::new(&Config) +
SourceConnector::scan(&SourceScope) impl. Uses
kb_core::to_posix for WorkspacePath construction (carries
P0-1 # rejection through unchanged) and kb_core::id_for_asset
for AssetId derivation. Storage variant signals intent only;
actual byte copy is P1-6's responsibility.
Per design §3.3, §6.2, §6.6, §7.1, §7.2, §8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>