58ac62d53a
feat(search): provenance 출처 필터 — [[workspace.sources]] 멀티소스 + --source/--source-type
...
혼합 출처 KB(위키+jira 등)에서 색인은 전부 하되 질의 시 출처로 좁히는 provenance
레버. 전역 trust 곱셈가중(weighted-RRF)은 A/B 에서 반증(θ=0.85 만으로 incident MRR
0.918→0.340 절벽, 점수 압축) — 필터가 see-saw 없는 올바른 레버.
- config [[workspace.sources]] (각 id/root/exclude/trust_level/source_type);
단일 root 는 implicit `default` source 로 정규화. validate: id 유일·비어있지 않음.
- config schema v3→v4 (step_3_to_4, root→[[workspace.sources]] id=default 미러, 멱등)
- V014 documents.source_id 컬럼+인덱스 (additive, DEFAULT 'default', 재색인 0)
- Metadata.source_id + BodyHints trust precedence(frontmatter > source 기본값 > Primary)
- ingest: --root 미지정 시 resolved_sources() 순회 + doc 마다 source_id/trust stamp
- 검색 SearchFilters.source_type/source_id → lexical + vector 두 site (IN, OR)
- CLI kebab search --source <id> / --source-type <type> (repeatable/comma-sep)
도그푸딩(620 doc, jira400+wiki220): --source wiki 로 개념 질의 MRR 0.780→0.810,
--source jira 로 incident 0.918→0.975. trust precedence 실측(jira=secondary 기본값).
version bump 0.28.0 → 0.29.0 (신규 CLI flag + config 키 + V014 migration → minor).
follow-up: MCP search 필터 미노출 · kebab list source_id 미표시 · RAG provenance 라벨.
자세한 내용: tasks/HOTFIXES.md (2026-06-21), docs/release-notes/v0.29.0-draft.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com >
Claude-Session: https://claude.ai/code/session_012Mc6W1fgsrbFKTsqA6P8La
2026-06-21 08:35:19 +00:00
d5c69f6715
refactor(config): v3 경로 call-site sweep (kebab-app/kebab-eval/kebab-parse-image)
...
부모 경로에 .ingest 삽입(leaf 구조체 불변). src + 테스트 call-site 전부.
kebab-cli 테스트의 v2 TOML fixture 는 from_file 자동변환(T6) 경로 검증용으로 유지.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com >
2026-06-04 12:40:06 +00:00
375a0693e4
chore(ocr): T11/T12 — clippy clean + docs + v0.27.0 bump
...
T11: fix 12 clippy lints in paddle_onnx.rs/paddle_e2e.rs (doc overindent,
finish_non_exhaustive, map_or_else, RangeInclusive::contains, cast_lossless,
is_some_and, usize::from). Full-workspace clippy -D warnings = 0.
Smoke (paddle-onnx, real binary): clean_paragraph OCR verbatim-correct, real
per-region confidence (0.99/0.96/0.95), FTS5 lexical hit on Korean(검색)+
English(embedding), parser_version folds |ocr:1:paddle-onnx:<ver>. Big page
<4s inference (5.6s ingest incl. one-time session load).
T12: README [image.ocr].engine + ARCHITECTURE OCR row + SMOKE paddle-onnx config
+ HANDOFF + HOTFIXES dated entry. Workspace version 0.26.2 → 0.27.0 (minor:
new engine value + config keys). .gitattributes: onnx as plain blobs (no git-lfs).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com >
2026-06-04 08:36:19 +00:00
901416d8e9
feat(ocr): T7-T9 — config overrides + engine factory + signature cascade
...
T7: OcrCfg gains det_model/rec_model/dict overrides + score_thresh/
unclip_ratio/max_boxes (serde default, KEBAB_IMAGE_OCR_* env). OnnxPaddleOcr::new
threads them via ModelPaths::from_config.
T8: build_image_ocr_engine / build_pdf_ocr_engine factories return
Box<dyn OcrEngine>; match on engine string (ollama-vision|paddle-onnx|err).
ImagePipeline.ocr_engine + pdf_ocr_engine signatures switched to &dyn OcrEngine.
OcrEngine gains model() for the progress label.
T9: ingest_config_signature image/pdf branches emit |ocr:1:{engine}:{engine_version}
(memoized blake3 per asset-triple, m3-safe). Unit tests (a)(b)(c) added.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com >
2026-06-04 08:15:30 +00:00
685007789a
style: cargo fmt --all (round 4 ingest log feature follow-up)
...
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.
추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)
workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-28 04:18:40 +00:00
241ded59df
test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
...
v0.20.0 sub-item 1 bugfix Step 3 (Group C) — integration-level regression
for Bug #3 (intra-doc chunk_id collision under aggressive overlap).
- `crates/kebab-app/tests/common/mod.rs`: `pub mod mock_ocr;` 1 line append.
- `crates/kebab-app/tests/common/mock_ocr.rs` (new): MockOcrEngine lift +
`single` / `per_page` ctor (backward-compat single + per-page cursor).
- `crates/kebab-app/tests/pdf_ocr_apply.rs`: inline MockOcrEngine 제거 +
`mod common; use common::mock_ocr::MockOcrEngine;` import. 10 ctor call
site migration (`MockOcrEngine { .. }` → `MockOcrEngine::single(...)`).
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs`
(new): F1 + F2 scanned PDF + Bug #3 trigger shape (10 char "가" + ". " +
500 char "나") via mock OCR. assertion: chunk_id global uniqueness (HashSet
dedup) across F1 + F2; F2 trigger text produces ≥2 chunks (collision shape).
- C1 decision: Option A (share via tests/common/mock_ocr.rs). Facade mock
injection unavailable (OllamaVisionOcr hardcoded) — helper-level chain test
(apply_ocr_to_pdf_pages → PdfPageV1Chunker) adds value beyond unit B5.
spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§4.5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 3)
prior: 436fd01 (Step 2 Bug #3 chunk_id fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-27 13:45:38 +00:00
th-kim0823
af80cedd81
feat(app): App::search_with_opts + SearchResponse (fb-34)
...
Budget loop: snippet shorten → k pop → ≥1 hit floor. Cursor
encode/decode threads corpus_revision; mismatch surfaces as
stale_cursor anyhow error. App::search retained as thin wrapper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-09 17:59:48 +09:00
th-kim0823
dfef65f196
feat(app): staleness module + post-process search hits (fb-32)
...
compute_stale: strict > boundary, threshold=0 disables, future
timestamps treated as fresh (clock skew safety). App::search
re-stamps on cache hit so config threshold changes take effect
without flushing the cache.
Also unblocks the workspace build by plugging placeholder
indexed_at/stale into the two AnswerCitation construction
sites in kebab-rag/pipeline.rs (the score-gate refusal path
forwards from SearchHit; the LLM-citation path uses
UNIX_EPOCH/false until Task 7 wires the real values through
pack_context).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-09 01:30:10 +09:00
ef5d0770ae
review(p9-fb-25-task1): fix kebab-app test references to removed WorkspaceCfg.include
...
reviewer-flagged: task 1 missed test files using cfg.workspace.include.
- crates/kebab-app/tests/common/mod.rs: SourceScope literal switched
to ..Default::default().
- crates/kebab-app/tests/image_pipeline.rs (×3): drop dead-no-op
cfg.workspace.include.push(...) calls; comment explains removal.
- crates/kebab-app/tests/pdf_pipeline.rs: same treatment.
Pre-fb-25 these pushes were no-ops (include was dead config field
not enforced anywhere). Removal is purely mechanical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-05 11:53:19 +00:00
3f0b00439a
review(p9-fb-10-task5): promote lexical_query to common + tighten Korean hit assertion
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-03 10:14:17 +00:00
911fb49550
refactor(rename): kb crates → kebab — Cargo packages, folders, Rust modules
...
프로젝트 이름 `kb` → `kebab` rename 의 첫 단계.
- workspace `Cargo.toml`: members `crates/kb-*` → `crates/kebab-*`,
repository URL `altair823/kb` → `altair823/kebab`.
- 18 crate 폴더 rename via `git mv` (history 보존).
- 각 crate `Cargo.toml`: `name = "kb-*"` → `"kebab-*"`, path deps
`../kb-*` → `../kebab-*`.
- 모든 `.rs`: `kb_<id>` snake-case 모듈 path 18 개 (`kb_core`,
`kb_config`, `kb_app`, `kb_cli`, `kb_eval`, `kb_search`, `kb_chunk`,
`kb_normalize`, `kb_source_fs`, `kb_parse_md`, `kb_parse_types`,
`kb_store_sqlite`, `kb_store_vector`, `kb_embed`, `kb_embed_local`,
`kb_llm`, `kb_llm_local`, `kb_rag`) → `kebab_<id>` 일괄 sed (단어
경계 \\b 사용해 영어 문장 안의 "kb" 약어 미오염).
CLI binary 이름 (`[[bin]] name = "kb"`), 환경변수 `KB_*`, XDG paths,
tracing target, 그리고 docs sweep 은 다음 commit 에서.
## 검증
- `cargo check --workspace` clean — 모든 crate 빌드 통과 후 commit.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-02 03:28:08 +00:00