Commit Graph

7 Commits

Author SHA1 Message Date
b9ee09f176 feat(app): wire PDF OCR enrichment + cancel propagation into ingest_one_pdf_asset (H-5 eager init + post-extract hook + per-page cancel) + workspace lopdf dep (Step 4 M-4)
Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan +
Step 7 spillover (IngestEvent variant + IngestItem field for compile
boundary) + Step 4 reviewer Minor M-4 fix.

E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry,
mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled ||
always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출
+ fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1
A1 cosmetic fix 정합).

E2 — `ingest_one_pdf_asset` signature extension: +3 param
(`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<&
mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`).
`ingest_one_asset` dispatch wrapper + caller (dispatch loop) update.

E3 — post-extract enrichment block at `extract_for` 직후 (line 1779).
`pdf.ocr.enabled || always_on` 시 `apply_ocr_to_pdf_pages` 호출,
PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished
with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field
로 carry. PR #187 registry dispatch invariant 보존
(`extract_for(&asset.media_type, ...)` 그대로).

E4 — cancel handle propagation: ingest_with_config_cancellable →
IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset →
ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain.
spec §4.8 line 1159 production wiring.

Step 7 spillover (compile boundary):
- `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } +
  PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant
  `pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치).
- `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> +
  pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF
  IngestItem construct site 가 `None` 채움.
- `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`:
  새 variant no-op handler (v1에서 per-page progress 미노출, future
  refinement 시 활성화 가능).
- `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot
  `ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가.
- Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot
  regenerate 는 별 commit (G3/H1/H2 deliverable).

M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합:
- workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`.
- kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`.

Verifier evidence:
- workspace test (`cargo test --workspace --no-fail-fast -j 1`):
  175 test result summary lines, 0 failures, 0 FAILED.
- workspace clippy (`-D warnings`): exit 0, 0 warning.
- dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`):
  empty diff for both.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2)
prior: 4672cba (Step 5 fix) + fd918a6 (Step 5) + 9f003ef (Step 4 helper)
contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:18:34 +00:00
27baec82ea fix(dogfood): auto-purge stored docs for filesystem-deleted files
Files deleted from disk (rm a.md) were leaving stale documents + chunks +
embeddings in the store, surfacing as ghost citations in search/ask.
Existing purge_orphan_at_workspace_path only handled content-changed
stale (WHERE workspace_path=? AND asset_id != ?) — file deletion has no
new asset_id.

Fix: post-walker-scan sweep. Compute (stored_paths - scanned_paths),
for each candidate check filesystem existence — only purge when the
file is TRULY missing. Scope-narrowing case (file on disk but outside
include glob) is explicitly NOT purged to protect users from accidental
data loss via config edits.

Adds:
- DocumentStore::all_workspace_paths trait method + SqliteStore impl
- purge_deleted_workspace_path in store-sqlite (returns chunk_ids for
  vector delete; deletes doc CASCADE + asset row + copied storage file)
- sweep_deleted_files in kebab-app::ingest path; called once per ingest
  before the per-asset loop
- IngestReport.purged_deleted_files counter (additive, serde default)
- CLI ingest summary mentions purge count when > 0
- 2 integration tests: file_deletion_auto_purge + include_scope_narrowing_does_NOT_purge

dogfood discovery (PR #142 1B + multi-root: kebab-docs + httpx + zod
+ lodash). Per user decision: only filesystem deletion auto-purges;
scope narrowing requires explicit kebab reset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 06:51:07 +00:00
th-kim0823
351c7a0826 feat(p10-1a-1): add IngestReport skip counters + SkipExamples
Adds five new u32 counters (skipped_gitignore, skipped_kebabignore,
skipped_builtin_blacklist, skipped_generated, skipped_size_exceeded)
and a SkipExamples struct (≤5 sample paths per category) to
IngestReport. All new fields are #[serde(default)] for backward-compat
deserialization. Downstream literal construction sites patched with
zeros/empty; snapshot re-baked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:28:19 +09:00
693f5582f0 feat(kebab-core, kebab-app): p9-fb-25 task 4 — IngestReport.skipped_by_extension + wire schema additive
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:06:34 +00:00
0684b3ad66 review(p9-fb-23-task1): fix missed IngestReport construction sites + snapshot
reviewer-flagged: aa2a6ea claimed build clean but missed:
- crates/kebab-store-sqlite/tests/ingest_report_snapshot.rs (test fixture)
- crates/kebab-cli/src/wire.rs (test fixture)
- crates/kebab-store-sqlite/snapshots/ingest_report.snapshot.json (snapshot)

All three add `unchanged: 0` (or `\"unchanged\": 0`) to match the new
IngestReport.unchanged field. cargo clippy --workspace --all-targets
-- -D warnings now clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:47:13 +00:00
7a49c8a29b feat(kebab-normalize): p9-fb-07 markdown title fallback chain
`kebab-normalize::derive_title(frontmatter_title, blocks, file_stem)` 가
다음 단계로 비어있지 않은 첫 결과를 사용:

1. frontmatter `title` (trim 후)
2. 첫 H1 텍스트
3. 첫 H2 텍스트
4. 첫 Paragraph (Quote / List / Code / Table / ImageRef 제외) 의 첫 80 자
5. 파일 stem (확장자 제외)
6. (sentinel) `"untitled"` — 위 다섯 단계가 모두 blank 인 병적 케이스

선택된 문자열은 NFC 정규화. 빈 문자열은 절대 반환하지 않음.

`build_canonical_document` 가 metadata lift 직후 helper 호출. 기존 단순
lift 로직 (metadata.user["title"] → CanonicalDocument.title) 은 fallback
chain 의 1 단계 입력으로 자리 이동.

`KEBAB_PARSE_MD_VERSION` 상수를 `pulldown-cmark-0.x` → `md-frontmatter-v2`
로 bump. parser_version 변경 → §4.2 doc_id 입력 변화 → 기존 markdown
doc 의 `doc_id` 갱신, 다음 ingest 시 idempotent upsert 로 자동 재처리
(design §9 cascade). `kebab-store-sqlite` 의 snapshot fixture 도 같은
literal 로 갱신.

기존 M7 정책 ("metadata.user[\"title\"] = '' 가 빈 title 로 lift") 은
폐기. 빈 문자열 입력은 fallback chain 을 타고 file stem 까지 떨어진다.
spec p9-fb-07 line 37: "빈 문자열 반환 금지".

테스트 (kebab-normalize):
- 8 개 단위 테스트 (각 fallback 단계 + NFC + sentinel)
- `build_canonical_document` 통합 테스트 2 개 (H1 / file stem)
- 기존 M7 테스트 2 개를 새 정책에 맞춰 갱신

문서:
- README: `kebab ingest` 행에 "title 자동 채움" 안내 + 기존 doc 도
  다음 ingest 에서 갱신
- HANDOFF: 2026-05-03 머지 후 발견 entry
- spec status: `planned` → `in_progress`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:22:34 +00:00
911fb49550 refactor(rename): kb crates → kebab — Cargo packages, folders, Rust modules
프로젝트 이름 `kb` → `kebab` rename 의 첫 단계.

- workspace `Cargo.toml`: members `crates/kb-*` → `crates/kebab-*`,
  repository URL `altair823/kb` → `altair823/kebab`.
- 18 crate 폴더 rename via `git mv` (history 보존).
- 각 crate `Cargo.toml`: `name = "kb-*"` → `"kebab-*"`, path deps
  `../kb-*` → `../kebab-*`.
- 모든 `.rs`: `kb_<id>` snake-case 모듈 path 18 개 (`kb_core`,
  `kb_config`, `kb_app`, `kb_cli`, `kb_eval`, `kb_search`, `kb_chunk`,
  `kb_normalize`, `kb_source_fs`, `kb_parse_md`, `kb_parse_types`,
  `kb_store_sqlite`, `kb_store_vector`, `kb_embed`, `kb_embed_local`,
  `kb_llm`, `kb_llm_local`, `kb_rag`) → `kebab_<id>` 일괄 sed (단어
  경계 \\b 사용해 영어 문장 안의 "kb" 약어 미오염).

CLI binary 이름 (`[[bin]] name = "kb"`), 환경변수 `KB_*`, XDG paths,
tracing target, 그리고 docs sweep 은 다음 commit 에서.

## 검증

- `cargo check --workspace` clean — 모든 crate 빌드 통과 후 commit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 03:28:08 +00:00