Duplicate of code-rust-ast-v1 with language-agnostic body unchanged. Cross-chunker
policy_hash identity asserted vs md-heading-v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rust path observably unchanged (verified by existing code_ingest_smoke tests).
Python/TS/JS arms bail with TODO; per-lang extractor + chunker land in subsequent tasks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`runner_lexical_is_deterministic_per_query_payload` 가 full-suite 첫 실행에서
간헐적으로 `elapsed_ms: 0` vs `elapsed_ms: 1` 차이로 깨지는 timing flake 가
있었음 (PR #140 회차 0 의 full-suite 실행에서 관찰).
원인: per_query 전체 JSON 을 byte-identical 비교하는데 QueryResult.elapsed_ms
가 timing 기반이라 µs-scale wall-clock jitter 가 그대로 비교에 들어감. 의도는
"timing 외에 byte-identical" — 인접 snapshot test #7 은 projection 으로
timing 을 명시적으로 제외하지만 #6 은 누락.
Fix: 비교 직전 양쪽 run 의 elapsed_ms 를 0 으로 normalize. 의도 그대로
표현하고 다른 field 의 결정성 검증은 보존. 50회 반복 stress 통과 (이전:
간헐 실패).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #140 회차 2 actionable 4건:
- README.md: `citation.kind = "code"` 행에서 wire 필드 구조 정정 — citation 안에는 `lang`, SearchHit top-level 에는 `code_lang`/`repo` (round 1 SMOKE 정정과 동일 클래스)
- docs/SMOKE.md: 격리 config 블록에 `extra_skip_globs = []` 추가 (P10 섹션의 "위 격리 config 블록 참조" 와 정합)
- crates/kebab-parse-code/src/rust.rs: comment-only 파일 → 0 blocks 동작을 module doc 에 한 줄 명시 (pdf-page-v1 의 "empty page produces no chunks" 패턴과 동일)
- .gitignore: `/target/` 제거 — `/target` (no trailing slash) 이 디렉토리 + 파일 + 심링크 모두 매칭하므로 `/target/` (dir 전용) 는 redundant
verify: `cargo check -p kebab-parse-code` clean (주석/문서 외 영향 없음).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #140 회차 1 actionable 7건 반영:
- docs/SMOKE.md: parser_version "code-rust-ast-v1" → "code-rust-v1" (chunker_version 과 혼동); jq path .citation.code_lang → .citation.lang (wire 의 code_lang 은 SearchHit top-level)
- docs/ARCHITECTURE.md: Mermaid pcode→ptypes 잘못된 edge → pcode→core 로 정정 (kebab-parse-code Cargo.toml 실제 dep 와 일치); 디렉토리 트리에서 code-rust-ast-v1 chunker 표기 위치 kebab-parse-code → kebab-chunk 로 정정
- crates/kebab-app/src/app.rs: backfill_repo 의 .ok().flatten() 실패 silent swallow → tracing::warn 로 관측 가능, 비-abort 의도 보존
- crates/kebab-parse-code/src/rust.rs: impl_item arm 의 "function_item 만 unit 생성" 1A scope 한정 주석을 외부에서도 보이도록 arm 상단에 한 줄 추가 (내부 주석은 유지)
verify: kebab-parse-code 7/7 / kebab-app --lib 51/51 / code_ingest_smoke 3/3 green; touched-crate clippy clean (재부팅 전 검증).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `SqliteStore::code_lang_breakdown()` that queries
`json_extract(metadata_json, '$.code_lang')`, groups by it, and
skips NULL rows — returning `BTreeMap<String, u32>`.
Wire it into `collect_stats` in `kebab-app::schema`, replacing the
`BTreeMap::new()` placeholder inserted by 1A-1.
Test: `store::tests::code_lang_breakdown_counts_by_code_lang` asserts
rust=1 and that a null-code_lang doc does NOT appear in the map.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `MediaType::Code("rust")` dispatch arm in `ingest_one_asset`,
`ingest_one_code_asset` fn (faithful mirror of `ingest_one_pdf_asset`),
and `backfill_code_lang` post-processing in `App::search_uncached`.
Integration test `code_ingest_smoke.rs` verifies full pipeline:
ingest `.rs` → Citation::Code hit with lang/symbol/line_start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDD: red → green cycle confirmed. New `Code(String)` variant serializes
as `{"code":"rust"}` via serde `rename_all = "lowercase"`. All exhaustive
`match` sites updated (`media_label`, `ingest_one_asset` catch-all →
explicit or-pattern). Design §3.5 enum listing synced. Also fix
`/target` symlink gitignore pattern so integration-test binary lookup
via workspace-relative path works with CARGO_TARGET_DIR redirect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
회차 1 review 의 4 건 actionable 모두 반영:
1. frozen design §2.1 의 code variant 예시에서 존재하지 않는 `repo` 필드 제거 + nested form 에서 actual wire (flat) 형태로 정리. 5 variant 의 nested-form illustrative example 은 그대로 두고, code variant 만 별도 block 으로 분리해서 actual wire 와 1:1 매칭. 또 위쪽 6 variant nested-form group 에서도 'code' 행 삭제 (정확한 contract 는 별도 block 에 있음).
2. §2.2 SearchHit 예시의 `repo: null, code_lang: null` + 'omitted when null' 주석 모순 제거 — 키 자체를 빼고 inline 주석으로 'markdown hit 에는 absent, 코드 hit 에서만 surface' 설명.
3. HANDOFF Phase row 식별자 `**10**` → `**P10**` (다른 row 와 일관성).
4. README synopsis 의 중복 `[--media code]` 제거 (`--media` 는 이미 위쪽에 한 번 있음, code 는 값 중 하나라 prose 에서 설명).
코드 변경 없음 — 모두 markdown 문서.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 16's new code_lang_breakdown / repo_breakdown fields broke the existing schema_wrapper_tags_schema_version test in wire.rs which constructs Stats { ... } literally. Use ..Default::default() since Stats now derives Default.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 13: add wire regression tests proving markdown SearchHit omits
repo/code_lang when None, and all 5 original Citation variants serialize
byte-identically without spurious Code-variant keys.
Task 15: add --repo (repeatable) and --code-lang (repeatable,
comma-separated) flags to `kebab search`; propagate both into
SearchFilters instead of the previous vec![] stub. Add
#[allow(clippy::large_enum_variant)] — Cmd is short-lived, boxing buys
nothing.
Task 16: add code_lang_breakdown and repo_breakdown BTreeMap fields to
Stats (schema.v1); derive Default on Stats; populate both as empty in
collect_stats (1A-2 fills them when code chunks land). Add unit test
asserting both keys are present in the serialized object.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire kebab_parse_code::is_generated_file and is_oversized into
FsSourceConnector::scan_with_skips. Files that pass gitignore/builtin/
kebabignore matching are now checked for generated-file markers
(config-gated via ingest.code.skip_generated_header) and byte/line caps
(ingest.code.max_file_bytes / max_file_lines). FsScanSkips gains
skipped_generated + skipped_size_exceeded counters; kebab-app threads
them into IngestReport. Also fixes a pre-existing clippy::derivable_impls
warning in IngestCfg. Three new connector tests cover all three paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add IngestCfg + IngestCodeCfg structs with serde defaults and embed
ingest: IngestCfg into the top-level Config. Existing configs without
an [ingest] section continue to load unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Refactor walker to expose WalkOverrides (combined + per-source matchers),
add walk_files_with_skips that returns accepted files alongside skip
attribution, wire FsSourceConnector::scan_with_skips into kebab-app so
IngestReport.skipped_gitignore, skipped_kebabignore, skipped_builtin_blacklist,
and skip_examples are populated instead of left at zero. Priority order
per spec §5.2 (builtin > gitignore > kebabignore) enforced in classify_skip,
with a directory-aware builtin matcher so pruned directory entries are
correctly attributed to builtin rather than a coincident gitignore entry.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevent double-`!` corruption when a `.gitignore` negation pattern
(e.g. `!keep/`) hits the trailing-slash normalizer in `read_gitignore`.
Also updates module-level and `build_overrides` doc to list all five
filter sources in application order, and adds a regression test.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds read_gitignore() (pub(crate), root-only, nested cascade deferred)
and merges its patterns as a 5th group in build_overrides(). Trailing-
slash patterns (dist/) are normalized to also emit a stem/** glob so
files inside the directory are matched when is_dir=false. Two new tests
cover both the happy path and the missing-file no-op.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires `kebab_parse_code::BUILTIN_BLACKLIST` (6 patterns: node_modules,
target, __pycache__, .venv, venv, env) into `build_overrides()` so the
walker automatically excludes these directories even when the user has
no `.kebabignore`. TDD cycle: 2 failing tests added first, then the
pattern-add loop inserted after the existing kbignore block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tasks 5-8: new `kebab-parse-code` crate with three infrastructure modules
for the code ingest framework. Ships lang.rs (extension→language identifier
mapping), repo.rs (.git walk-up via gix 0.70 for RepoMeta), and skip.rs
(BUILTIN_BLACKLIST, is_generated_file, is_oversized). 14 integration tests
across three test files, all passing; clippy -D warnings clean.
Note: gix pinned to 0.70 (not 0.83 as originally suggested) because 0.83
fails to compile against Rust 1.94.1 due to non-exhaustive match patterns
in gix-hash. 0.70 resolves cleanly and has identical head_name/head_id API.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four optional, serde-skipped-when-None fields added to `Metadata` for
code ingest context. All 11 downstream construction sites patched with
`repo: None, git_branch: None, git_commit: None, code_lang: None`.
Full workspace check (`--tests`) and per-crate test suite pass clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds five new u32 counters (skipped_gitignore, skipped_kebabignore,
skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded)
and a SkipExamples struct (≤5 sample paths per category) to
IngestReport. All new fields are #[serde(default)] for backward-compat
deserialization. Downstream literal construction sites patched with
zeros/empty; snapshot re-baked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire two new optional fields onto SearchHit (skip_serializing_if = None)
and two Vec<String> filter fields onto SearchFilters (serde default).
Add RetrievalDetail::Default impl (manual, uses SearchMode::Hybrid as
sentinel). Patch all downstream SearchHit / SearchFilters literal
constructors with repo: None / code_lang: None / vec![] as appropriate.
Also covers Citation::Code arm in kebab-eval metrics match.
1A 가 들고 들어가는 *프레임워크 surface* (Citation `code` variant, SearchHit repo/code_lang, --media code / --code-lang / --repo filter, skip 정책, IngestReport 세분화, config 절, kebab-parse-code crate skeleton) 가 *언어 chunker 자체* 와 독립 검증 가능 — 1A-1 머지 후 기존 markdown corpus 의 wire 출력이 byte-level identical 한지 regression test 로 검증한 다음 1A-2 에서 Rust AST chunker 자체에 집중. binary version bump 트리거도 1A-2 로 미룸 (1A-1 은 wire additive minor + 사용자 surface 변경 없음).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>