feat(fb-36): search filter args (--media / --ingested-after / --doc-id + 4 existing) #127
Reference in New Issue
Block a user
Delete Branch "feat/fb-36-search-filters"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
kebab searchand the equivalent inputs onmcp__kebab__search:SearchFiltersfields exposed:--tag(repeatable, OR-within),--lang,--path-glob,--trust-min--media(csv,mdalias),--ingested-after(RFC3339 UTC),--doc-id--tagand--mediaCASE WHEN json_type='text'to handle both unit and tupleMediaTypeserde shapes), over-fetch +filter_chunkspost-filter for vector (sharedkebab-store-sqlite::filter_chunkshelper)search_response.v1andsearch_hit.v1untouched--ingested-after/ unknowntrust_min→error.v1.code = config_invalid(CLI) /invalid_input(MCP); unknown--mediavalue → empty hits, no error--trust-minacceptsprimary | secondary | generated(matches actualkebab_core::TrustLevelvariants)Test plan
cargo test --workspace --no-fail-fast -j 1— greencargo clippy --workspace --all-targets -- -D warnings— clean--media md(alias),--ingested-after garbage(config_invalid + exit 2),--ingested-after 2030-01-01T00:00:00Z(no hits),--doc-id <id>(scope),--tag rust(frontmatter)Architectural notes
SearchFilters3 fields are additive with#[serde(default)]— old JSON without the new keys deserializes cleanly.MediaTypeJSON has two shapes ("markdown"for unit variants,{"image":"png"}for tuple variants); the SQLCASE WHEN json_type='text' THEN json_extract($) ELSE (first object key) ENDextracts a unified kind string.kebab-store-sqlite::SqliteStore::filter_chunks, which gained the same 3 WHERE arms — single source of truth for the filter SQL across lexical's SQL builder and vector's post-filter.path_globremains a Rust post-filter — unchanged from before fb-36.--trust-minplan/spec drafts mentionedTrusted/Reviewed/Hearsay/Untrusted; actualkebab_core::TrustLevelvariants arePrimary/Secondary/Generated. Implementation and docs use the real variants.Files of interest
docs/superpowers/specs/2026-05-10-p9-fb-36-search-filters-design.mddocs/superpowers/plans/2026-05-10-p9-fb-36-search-filters.mdcrates/kebab-core/src/search.rs(SearchFilters)crates/kebab-search/src/lexical.rs+kebab-store-sqlite/src/filters.rs(shared filter_chunks)crates/kebab-cli/src/main.rs(Cmd::Search)crates/kebab-mcp/src/tools/search.rs(SearchInput)SQL WHERE clause extension. media uses CASE WHEN json_type='text' to handle both unit (\`"markdown"\`) and tuple (\`{"image":"png"}\`) MediaType serde shapes. ingested_after relies on RFC3339 lexicographic ordering with UTC Z (per fb-32 ingest invariant). doc_id is a simple equality. AND combinator with existing tags / lang / trust filters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>fb-36 round 1 — overall solid. SearchFilters extension,
filter_chunksextension, lexical SQL extension, CLI/MCP dispatch, and tests all line up with the spec, andcargo test -p kebab-store-sqlite filter_chunks+-p kebab-search --test lexical+-p kebab-cli --test wire_search_filtersall green locally. Lexical and vector go through identical SQL fragments for the 3 new arms (subqueryIN (SELECT d2.doc_id FROM documents d2 JOIN assets a ...)for media,d.updated_at >= ?for ingested_after,d.doc_id = ?for doc_id), so behavior parity is preserved. Wire schema unchanged, MCPSearchInputextended additively,#[serde(default)]on the newSearchFiltersfields keeps backwards compat.Three changes requested before merge:
1. (Important) README documentation drift on
--tagREADME.md
kebab searchrow says: "--tag/--media는 각각,구분 다중 값 OR 매칭". This is wrong about--tag. The CLI declares:i.e. repeatable but not
value_delimiter = ','. Only--mediais csv. Spec correctly says--tag rust --tag async(repeatable). Either:--tag는 반복 (--tag a --tag b),--media는,구분", orvalue_delimiter = ','totagand update the spec.The first is the smaller / spec-compliant fix.
2. (Important)
--ingested-afteraccepts non-UTC offsets but compares lexicographicallySpec §Filter layer says "ingested_after: d.updated_at >= ? (RFC3339 lexicographic compare; UTC
Z가정)". The CLI flag doc also says "(UTC)". But the implementation incrates/kebab-cli/src/main.rs:686-712andcrates/kebab-mcp/src/tools/search.rs:74-89parses any RFC3339 — including offsets like+09:00— andcrates/kebab-search/src/lexical.rs:352-358/crates/kebab-store-sqlite/src/filters.rs:159-165then formats theOffsetDateTimewith its original offset, e.g.2026-04-01T00:00:00+09:00. A lexicographic>=against DB values that always storeZis wrong: ASCII'+'(0x2B) <'Z'(0x5A), so2026-04-01T00:00:00+09:00 >= 2026-04-01T00:00:00Zevaluates true even though the actual instant is 9 hours earlier than the stored ones — the filter silently misses recent docs.Pick one:
crates/kebab-search/src/lexical.rsandcrates/kebab-store-sqlite/src/filters.rs.config_invalid/invalid_input.The first preserves the documented "any RFC3339 accepted" surface and just fixes the comparison. Since both retrievers format the timestamp themselves, both call sites need the change.
3. (Important — test gap) Multi-value
--tagnot exercisedSpec test plan line 172: "
kebab search Q --tag rust --tag async --jsonIN-list 동작".crates/kebab-cli/tests/wire_search_filters.rs::search_with_tag_filter_matches_frontmatter_tagsonly uses single--tag rust. Add a case that ingests three docs (tag=rust, tag=async, tag=other), runs--tag rust --tag async, and asserts both tagged docs surface but the third does not. This protects thetags_anyIN-list behavior from regression (esp. given the README's CSV claim makes it tempting to addvalue_delimiter = ','later, which would silently change semantics for--tag rust,async).Minor (nice-to-have, not blocking)
crates/kebab-mcp/src/tools/search.rs:164writes the string literal"error.v1"; matcheskebab_app::ERROR_V1_IDbut using the constant (ascrates/kebab-cli/src/main.rs:697does) is more consistent with the rest of the codebase and survives a future schema-id rename.f.doc_id IN (SELECT d2.doc_id FROM documents d2 JOIN assets a ...)even though the outer query already joinsdocuments d. A simplerJOIN assets a ON a.asset_id = d.asset_id+WHERE CASE ... END IN (...)would match the existing tag/lang/trust pattern. The vector path can keep the subquery since its outer query is different. Skip if you prefer the current shape — vector and lexical SQL would diverge slightly but both are already idempotent.Round 2 review — all round 1 fixes verified, no follow-ons.
Round 1 verification (commit
84287d0):--ingested-afternon-UTC offset:to_offset(time::UtcOffset::UTC)applied at both SQL-formatting sites (crates/kebab-search/src/lexical.rs:357,crates/kebab-store-sqlite/src/filters.rs:164). Comments updated to explain the lex-compare bug (+0x2B <Z0x5A in ASCII).--tagrewritten as repeatable-only (only--mediais csv).search_with_two_tag_filters_returns_or_within_tagsinwire_search_filters.rs:233. Asserts a.md (rust) ∪ b.md (async), excludes c.md (no tags).ERROR_V1_IDconst used inkebab-mcp/src/tools/search.rs:166.filter_chunks_ingested_after_non_utc_offset_compares_as_instant(filters.rs:675) seeds a doc at2026-04-01T01:00:00Zand queries2026-04-01T05:00:00+09:00(==2026-03-31T20:00:00Z); doc must match. Tight, surgical regression coverage.Round 2 follow-on probing — nothing actionable found:
to_offset(UTC)produces the canonicalZform regardless of source offset sign. Thetimecrate's RFC3339 well-known formatter consistently outputs...Zfor UTC instants. No new edge case.OffsetDateTime(preserving offset) →SearchFilters→ SQL formatter. Fix is at the SQL formatter (the right layer — keepOffsetDateTimesemantically rich until the storage boundary). Confirmed by grep across CLI / MCP / search / store crates.Risks / notesdoesn't actually call out a "UTC Z 가정" rule — only the in-code comments did, and they've been updated. The behavior fix lands in the same PR before merge, so it's not a post-merge deviation.lexical_filter_by_ingested_aftercovers UTC, and the filter_chunks regression test covers the structural bug. Adding a duplicate at the lexical-integration layer is low marginal value — acceptable.updated_atvsingested_afternaming: ingest commits updateupdated_at, so the column semantically tracks last-ingest time. Fine.Verification:
cargo clippy --workspace --all-targets -- -D warnings→ cleancargo test --workspace --no-fail-fast -j 1→ all green, no FAIL linesLGTM. Ready to merge.