kebab

Author	SHA1	Message	Date
altair823	685007789a	style: cargo fmt --all (round 4 ingest log feature follow-up) Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이 test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all 적용. 모든 import alphabetical reorder + line wrapping 정합. 추가 untracked artifact 동시 commit: - docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT) - docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT) workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 04:18:40 +00:00
altair823	e674ff474b	fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4 ) v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf `get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte- level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만 Pages dict 의 `/Kids` reference 깨짐. - `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level `re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+ del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이 ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip (removed=1 invariant). Exit codes 2/3/4 for invariant fail. - `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible. - `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`: regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline bootstrap, no insta crate). - `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3 invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent, (3) PdfTextExtractor 1-block invariant. - `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는 CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족. spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5) plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4) prior: `241ded5` (Step 3 integration test) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 14:02:17 +00:00
altair823	48197687b7	test(pdf): integration smoke (w/ search + cancel) + vector regression + alnum e2e (#[ignore]) for v0.20 sub-item 1 Step 9 (Group I) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. I3 — crates/kebab-app/tests/ingest_pdf_ocr_smoke.rs (신규): - ingest_with_mock_ocr_yields_pdf_ocr_summary — `#[ignore]` real Ollama, ingest_with_config production path + IngestItem.pdf_ocr_pages verify. - ocr_text_indexed_and_searchable — `#[ignore]` real Ollama, app.search 의 OCR text indexed verify (§ Acceptance #2). - ingest_with_cancel_aborts_mid_pdf — production cancel chain (pre-set cancel=true + dummy endpoint, no panic/deadlock verify). I4 — crates/kebab-parse-pdf/tests/text_extractor_regression.rs (신규): - vector_pdf_extract_byte_identical_to_baseline — F4 mojibake.pdf 의 vector PDF path canonical 의 byte-identical 보존 (Step 1-8 모든 변경 전후 invariant). - baseline 신규 = tests/snapshots/vector_pdf_canonical.json (first run create). - normalize_provenance_timestamps inline helper (R-3 mitigation, workspace 전체 부재 — 신규 12-line). I5 — crates/kebab-parse-pdf/tests/ocr_e2e.rs (신규): - f1_alnum_accuracy_ge_85 / f2_alnum_accuracy_ge_70 — `#[ignore]` real Ollama qwen2.5vl:3b, § Acceptance §9 #3 의 implementation. - alnum metric = strsim::levenshtein (dev-dep 추가). - truth file copy from PoC scratch (page1.txt + page2-batchim.txt) → scanned_page1_truth.txt + scanned_page2_truth.txt. - kebab-parse-image dev-dep 추가 (OllamaVisionOcr::from_parts 호출용). parser isolation invariant 의 dev-dep exception (spec §3.1, dep graph baseline -e normal 보존). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 9 I3+I4+I5) prior: `c9e0594` (Step 8 CLI printer) contract: §9 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 10:10:58 +00:00
altair823	c2cd3a7ab7	feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` -> Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1 line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른 encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode passthrough only (H-3 invariant, image crate 도입 0). Integration test (`tests/page_image.rs`, 2 test): - f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path. - flate_raw_fixture_yields_none — F6 fixture negative path. C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char = ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin Extended + common Korean punctuation. 빈 string → 0.0. caller (`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5). Unit test (`mod tests`, 7 + F4 conditional): - empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo. - f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가 ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3). Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page image extract helpers ...", Step 1 A2 이연분). fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차 수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결). spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2) prior: `aeeff36` (Step 2 fixtures) + `fb3952d` (Step 2 F7 record fix) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 05:59:10 +00:00
altair823	aeeff3635b	poc+test(pdf-ocr): lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) for v0.20 sub-item 1 Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan. B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures, result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md). Key findings: - reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ]; useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0. - Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires manual PDF construction via zlib.compress. - ghostscript pdfwrite rejects TIFF input (/undefined in II*) — ImageMagick `convert -compress Group4` used for F7 CCITTFax. B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/: - F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어). - F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침). - F4 mojibake.pdf — DejaVu TTF + ToUnicode CMap stripped (count=0); Noto CJK TTC has PostScript outlines unsupported by reportlab. - F6 flate_raw.pdf — /Filter /FlateDecode, DCTDecode absent (skip path input). - F7 ccitt.pdf — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input). Synth scripts under tests/fixtures/_synth/: - scanned_pdf.py — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0). - mojibake.py — F4 reportlab DejaVu TTF + ToUnicode strip. - flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert. spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1) plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2) contract: §9 (additive minor wire bump — 후속 step) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 04:04:47 +00:00
altair823	8de08cf38c	review(p7-1): 회차 1 지적 반영 - Cargo.toml: 사용하지 않는 deps 제거 (`kebab-config`, `thiserror`, `pdf-extract`, dev `tempfile` / `serde_json` / `serde`). 특히 `pdf-extract` 가 끌어오던 transitive ~150 crate (pom, postscript, type1-encoding-parser, adobe-cmap-parser, euclid, chrono, md5, linked-hash-map …) 가 모두 사라짐. lopdf 만 남음. - info.rs: BOM 없는 PDFDocEncoded Title 디코드 버그 수정. `from_utf8_lossy` 는 0x80–0xFF 를 U+FFFD 로 치환해 "Café" 같은 레거시 타이틀을 망가뜨림. byte → `char` 직접 캐스팅 (Latin-1 디코더) 로 교체. 회귀 테스트 `info_dict_title_pdfdocencoding_latin1_high_bytes_decoded` 추가. - info.rs: 모듈 doc 의 "Latin-1 superset" 부정확 표현 정정 — PDFDocEncoding 은 0x18–0x1F / 0x80–0x9F 영역에서 Latin-1 과 다름. - lib.rs: `saturating_sub(1)` 가 page=0 케이스를 silent 흡수하던 부분에 `debug_assert!` 추가. release 는 saturating fallback 유지 (panic 보다 garbled order 가 운영에 유리). - tests: UTF-16 surrogate pair 커버리지 갭 보완 — 🥙 (U+1F959) 가 포함된 타이틀로 `String::from_utf16_lossy` 의 페어-결합 경로 검증. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:40:40 +00:00
altair823	5a158d7343	feat(kebab-parse-pdf): P7-1 text PDF extractor — per-page CanonicalDocument `PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출. 페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }` emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning` ("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력. 핵심 동작: - `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러 (`qpdf --decrypt` 안내). - 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed page panic 을 recoverable warning 으로 변환. - `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리). - 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse, corrupt header error, page_count 메타, UTF-16BE Title, filename fallback, determinism, snapshot. `parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7` (원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:34:55 +00:00

7 Commits