kebab

Author	SHA1	Message	Date
altair823	48197687b7	test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1 Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규): - ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama, ingest_with_config production path + IngestItem.pdf_ocr_pages verify. - ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search 의 OCR text indexed verify (§ Acceptance #2). - ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set cancel=true + dummy endpoint, no panic/deadlock verify). I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규): - vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant). - baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create). - normalize_provenance_timestamps inline helper (R-3 mitigation, workspace 전체 부재 — 신규 12-line). I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규): - f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation. - alnum metric = strsim::levenshtein (dev-dep 추가). - truth file copy from PoC scratch (page1.txt + page2-batchim.txt) → scanned_page1_truth.txt + scanned_page2_truth.txt. - kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용). parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph baseline -e normal 보존). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5) prior: `c9e0594` (Step 8 CLI printer) contract: §9 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 10:10:58 +00:00
altair823	b9ee09f176	feat(app): wire PDF OCR enrichment + cancel propagation into ingest_one_pdf_asset (H-5 eager init + post-extract hook + per-page cancel) + workspace lopdf dep (Step 4 M-4) Step 6 (Group E) of v0.20.0 sub-item 1 (scanned PDF OCR) plan + Step 7 spillover (IngestEvent variant + IngestItem field for compile boundary) + Step 4 reviewer Minor M-4 fix. E1 — eager PDF OCR engine build at `ingest_with_config_opts` entry, mirror of image OCR pattern (lib.rs:338-347). `pdf.ocr.enabled \|\| always_on` 시 `OllamaVisionOcr::from_parts(endpoint, model, ...)` 호출 + fail-fast `?`. App field 추가 0 (local var only, spec L-1 / Step 1 A1 cosmetic fix 정합). E2 — `ingest_one_pdf_asset` signature extension: +3 param (`pdf_ocr_engine: Option<&OllamaVisionOcr>`, `progress: Option<& mpsc::Sender<IngestEvent>>`, `cancel: Option<&Arc<AtomicBool>>`). `ingest_one_asset` dispatch wrapper + caller (dispatch loop) update. E3 — post-extract enrichment block at `extract_for` 직후 (line 1779). `pdf.ocr.enabled \|\| always_on` 시 `apply_ocr_to_pdf_pages` 호출, PdfOcrProgress → IngestEvent emit (PdfOcrStarted / PdfOcrFinished with ocr_engine), summary 의 pages_ocrd/ms_total 을 IngestItem field 로 carry. PR #187 registry dispatch invariant 보존 (`extract_for(&asset.media_type, ...)` 그대로). E4 — cancel handle propagation: ingest_with_config_cancellable → IngestOpts.cancel → ingest_with_config_opts → ingest_one_asset → ingest_one_pdf_asset (new `cancel` param) → PdfOcrOpts.cancel chain. spec §4.8 line 1159 production wiring. Step 7 spillover (compile boundary): - `kebab_app::ingest_progress::IngestEvent`: PdfOcrStarted { page } + PdfOcrFinished { page, ms, chars, ocr_engine }. serde discriminant `pdf_ocr_started` / `pdf_ocr_finished` (Step 7 G3 wire schema 와 일치). - `kebab_core::IngestItem`: pdf_ocr_pages: Option<u32> + pdf_ocr_ms_total: Option<u64> (warnings/error 사이). 11 non-PDF IngestItem construct site 가 `None` 채움. - `kebab-cli/src/progress.rs` + `kebab-tui/src/ingest_progress.rs`: 새 variant no-op handler (v1에서 per-page progress 미노출, future refinement 시 활성화 가능). - `kebab-store-sqlite/tests/ingest_report_snapshot.rs` + snapshot `ingest_report.snapshot.json`: 2 IngestItem fixture 의 새 field 추가. - Step 7 의 JSON Schema 갱신 + CLI printer activation + snapshot regenerate 는 별 commit (G3/H1/H2 deliverable). M-4 (Step 4 reviewer Minor) — lopdf workspace dep 통합: - workspace `Cargo.toml [workspace.dependencies] lopdf = "0.32"`. - kebab-app + kebab-parse-pdf 의 direct dep → `{ workspace = true }`. Verifier evidence: - workspace test (`cargo test --workspace --no-fail-fast -j 1`): 175 test result summary lines, 0 failures, 0 FAILED. - workspace clippy (`-D warnings`): exit 0, 0 warning. - dep graph baseline (`.omc/state/pdf-ocr-{parse-pdf,app-parse}-deps.baseline.txt`): empty diff for both. spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.4 + §4.6 + §4.8) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 6 E1-E4 + Step 7 partial G1+G2) prior: `4672cba` (Step 5 fix) + `fd918a6` (Step 5) + `9f003ef` (Step 4 helper) contract: §9 (additive minor wire bump — Step 7 JSON Schema 완료 시) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 08:18:34 +00:00
altair823	c2cd3a7ab7	feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` -> Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1 line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른 encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode passthrough only (H-3 invariant, image crate 도입 0). Integration test (`tests/page_image.rs`, 2 test): - f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path. - flate_raw_fixture_yields_none — F6 fixture negative path. C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char = ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin Extended + common Korean punctuation. 빈 string → 0.0. caller (`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5). Unit test (`mod tests`, 7 + F4 conditional): - empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo. - f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가 ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3). Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page image extract helpers ...", Step 1 A2 이연분). fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차 수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2) prior: `aeeff36` (Step 2 fixtures) + `fb3952d` (Step 2 F7 record fix) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 05:59:10 +00:00
altair823	7c85de065a	chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix cut PR v0.18.0 전 마지막 정리. 사용자 요청: "전체 코드베이스를 깔끔하고 알아보기 쉽게". ## Workspace lints - `Cargo.toml` 의 `[workspace.lints.clippy]` 에 `pedantic = "warn"` (priority -1) + 의도적 allow-list 추가: - cast_possible_truncation / cast_possible_wrap / cast_sign_loss / cast_precision_loss — ONNX i64 / hash modular reduction 등 의도적 truncation. - doc_markdown / missing_errors_doc / missing_panics_doc — cosmetic doc style. - too_many_lines / module_name_repetitions / must_use_candidate / needless_pass_by_value / manual_let_else / items_after_statements / similar_names — informational only. - format_collect / match_wildcard_for_single_variants / trivially_copy_pass_by_ref / unnecessary_wraps — intentional patterns (exhaustive match, future Result variants 등). - default_trait_access — `Foo::default()` 가 idiomatic. - float_cmp — NLI / RRF score 의 explicit threshold 비교 의도. - struct_excessive_bools / case_sensitive_file_extension_comparisons / naive_bytecount / ignore_without_reason — domain-specific 의도. - format_push_string / return_self_not_must_use / match_same_arms — builder / wire-label / hot-path 패턴 보존. - needless_continue / used_underscore_binding / nonminimal_bool / unreadable_literal / many_single_char_names / doc_link_with_quotes / assigning_clones / collapsible_str_replace / trivial_regex / elidable_lifetime_names / range_plus_one / explicit_iter_loop / implicit_hasher / ref_option — remaining low-value style. - 각 24 crate `Cargo.toml` 에 `[lints] workspace = true` 추가. ## Auto-fix `cargo clippy --workspace --all-targets --fix` 적용 — 128 files changed, 552 insertions / 472 deletions. 주로: - uninlined_format_args (~18): `format!("{}", x)` → `format!("{x}")`. - redundant_closure_for_method_calls (~33): `.map(\|x\| x.foo())` → `.map(T::foo)`. - 그 외 mechanical refactor. ## 검증 - `cargo clippy --workspace --all-targets -j 1 -- -D warnings` clean (pedantic + 모든 lint group). - `cargo test --workspace --no-fail-fast -j 1` — 1293 tests pass + 1 pre-existing flaky fail (`kebab-mcp::tools_call_ask_multi_hop::ask_tool_routes_multi_hop_true_to_decompose_first`, HOTFIX candidate, cleanup 무관). 회귀 0. Wire 영향: 없음. Behavior 영향: 없음 (mechanical refactor only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 03:01:58 +00:00
altair823	8de08cf38c	review(p7-1): 회차 1 지적 반영 - Cargo.toml: 사용하지 않는 deps 제거 (`kebab-config`, `thiserror`, `pdf-extract`, dev `tempfile` / `serde_json` / `serde`). 특히 `pdf-extract` 가 끌어오던 transitive ~150 crate (pom, postscript, type1-encoding-parser, adobe-cmap-parser, euclid, chrono, md5, linked-hash-map …) 가 모두 사라짐. lopdf 만 남음. - info.rs: BOM 없는 PDFDocEncoded Title 디코드 버그 수정. `from_utf8_lossy` 는 0x80–0xFF 를 U+FFFD 로 치환해 "Café" 같은 레거시 타이틀을 망가뜨림. byte → `char` 직접 캐스팅 (Latin-1 디코더) 로 교체. 회귀 테스트 `info_dict_title_pdfdocencoding_latin1_high_bytes_decoded` 추가. - info.rs: 모듈 doc 의 "Latin-1 superset" 부정확 표현 정정 — PDFDocEncoding 은 0x18–0x1F / 0x80–0x9F 영역에서 Latin-1 과 다름. - lib.rs: `saturating_sub(1)` 가 page=0 케이스를 silent 흡수하던 부분에 `debug_assert!` 추가. release 는 saturating fallback 유지 (panic 보다 garbled order 가 운영에 유리). - tests: UTF-16 surrogate pair 커버리지 갭 보완 — 🥙 (U+1F959) 가 포함된 타이틀로 `String::from_utf16_lossy` 의 페어-결합 경로 검증. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:40:40 +00:00
altair823	5a158d7343	feat(kebab-parse-pdf): P7-1 text PDF extractor — per-page CanonicalDocument `PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출. 페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }` emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning` ("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력. 핵심 동작: - `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러 (`qpdf --decrypt` 안내). - 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed page panic 을 recoverable warning 으로 변환. - `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리). - 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse, corrupt header error, page_count 메타, UTF-16BE Title, filename fallback, determinism, snapshot. `parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7` (원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:34:55 +00:00

6 Commits