Commit Graph

10 Commits

Author SHA1 Message Date
685007789a style: cargo fmt --all (round 4 ingest log feature follow-up)
Phase C4 executor 의 마지막 `fix(test): clippy + fmt fixes` commit 이
test file 부분만 fmt 적용. workspace 전체 fmt 누락 발견 → cargo fmt --all
적용. 모든 import alphabetical reorder + line wrapping 정합.

추가 untracked artifact 동시 commit:
- docs/superpowers/specs/2026-05-28-v0.20-ingest-log-spec.md (491 line, ACCEPT)
- docs/superpowers/plans/2026-05-28-v0.20-ingest-log-plan.md (616 line, ACCEPT)

workspace test: 1370 passed / 0 failed / 50 ignored, ingest_log_smoke green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:18:40 +00:00
a58ee10dfb fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)
Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의
ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit
`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char()
의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5
threshold bypass.

Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계
(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker
list extensible. is_valid_text_char() 본체 변경 0.

Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version
/ wire schema 변경 0.

Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
§4.1 / §4.2 / §6 R-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:42:59 +00:00
e674ff474b fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)
v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf
`get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte-
level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만
Pages dict 의 `/Kids` reference 깨짐.

- `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level
  `re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+
  del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이
  ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip
  (removed=1 invariant). Exit codes 2/3/4 for invariant fail.
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via
  pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible.
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`:
  regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline
  bootstrap, no insta crate).
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3
  invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent,
  (3) PdfTextExtractor 1-block invariant.
- `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold
  threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken
  fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는
  CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족.

spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4)
prior: 241ded5 (Step 3 integration test)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:02:17 +00:00
8d81bc1071 style(parse-pdf): satisfy clippy pedantic in page_image (uninlined_format_args + map_unwrap_or)
c2cd3a7 의 `extract_dctdecode_page_image` 에 workspace clippy pedantic 위반 2 건
잔존 → CI gate (cargo clippy --workspace --all-targets -- -D warnings) fail.
두 lint 모두 1-line edit + semantic 동일, logic 변경 0.

- L20  uninlined_format_args: format!("page {} not in get_pages()", page_num)
                              → format!("page {page_num} not in get_pages()")
- L48-52 map_unwrap_or:        .map(|n| n == b"Image").unwrap_or(false)
                              → .is_some_and(|n| n == b"Image")

cargo clippy --workspace --all-targets -j 4 -- -D warnings → exit 0.
cargo test -p kebab-parse-pdf -j 4 → 21 passed (regression 0).

review trail:
- spec review: .omc/reviews/2026-05-27-pdf-ocr-step-03-spec-review-result.md (SPEC_COMPLIANT)
- code review: .omc/reviews/2026-05-27-pdf-ocr-step-03-code-review-result.md (Critical C-1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 06:14:00 +00:00
c2cd3a7ab7 feat(parse-pdf): add page_image (DCTDecode passthrough, 2 test) + text_quality (valid char ratio, 8 unit test) modules
Step 3 (Group C) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

C1 — `page_image::extract_dctdecode_page_image(pdf_doc, page_num)` ->
Result<Option<Vec<u8>>>. lopdf 의 Resources/XObject traverse, 첫 image
XObject 의 /Filter 검사 (single Name OR Array form 모두 cover, spec §4.1
line 642-664), DCTDecode + JPEG magic 검증 통과 시 raw bytes 반환. 다른
encoding 또는 image XObject 부재 시 Ok(None). v1 scope = DCTDecode
passthrough only (H-3 invariant, image crate 도입 0).

Integration test (`tests/page_image.rs`, 2 test):
- f1_fixture_yields_dctdecode_jpeg_bytes — F1 fixture happy path.
- flate_raw_fixture_yields_none — F6 fixture negative path.

C2 — `text_quality::compute_valid_char_ratio(s) -> f32`. valid char =
ASCII printable + Hangul (Jamo/Compatibility/Syllables) + CJK + Latin
Extended + common Korean punctuation. 빈 string → 0.0. caller
(`kebab-app::pdf_ocr_apply`) 가 threshold 와 비교 (default 0.5).

Unit test (`mod tests`, 7 + F4 conditional):
- empty / pure ASCII / pure Hangul / pure PUA / mixed half / CJK / Hangul Jamo.
- f4_fixture_ratio_under_threshold: active (case A — lopdf extract_text 가
  ToUnicode CMap 부재 시 빈 string 반환 → valid_ratio = 0.0000 < 0.3).

Also: Cargo.toml description 갱신 ("Text PDF extractor + scanned-page
image extract helpers ...", Step 1 A2 이연분).

fixture fix: mojibake.pdf 의 startxref 22130 → 22114 (16-byte offset 오차
수정 — lopdf strict parser 가 xref 를 찾지 못하는 버그 해결).

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§4.1 line 600-722)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 3 C1+C2)
prior: aeeff36 (Step 2 fixtures) + fb3952d (Step 2 F7 record fix)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:59:10 +00:00
7c85de065a chore: workspace-wide cleanup — clippy::pedantic baseline + auto-fix
cut PR v0.18.0 전 마지막 정리. 사용자 요청: "전체 코드베이스를 깔끔하고 알아보기 쉽게".

## Workspace lints

- `Cargo.toml` 의 `[workspace.lints.clippy]` 에 `pedantic = "warn"` (priority -1) + 의도적 allow-list 추가:
  - cast_possible_truncation / cast_possible_wrap / cast_sign_loss / cast_precision_loss — ONNX i64 / hash modular reduction 등 의도적 truncation.
  - doc_markdown / missing_errors_doc / missing_panics_doc — cosmetic doc style.
  - too_many_lines / module_name_repetitions / must_use_candidate / needless_pass_by_value / manual_let_else / items_after_statements / similar_names — informational only.
  - format_collect / match_wildcard_for_single_variants / trivially_copy_pass_by_ref / unnecessary_wraps — intentional patterns (exhaustive match, future Result variants 등).
  - default_trait_access — `Foo::default()` 가 idiomatic.
  - float_cmp — NLI / RRF score 의 explicit threshold 비교 의도.
  - struct_excessive_bools / case_sensitive_file_extension_comparisons / naive_bytecount / ignore_without_reason — domain-specific 의도.
  - format_push_string / return_self_not_must_use / match_same_arms — builder / wire-label / hot-path 패턴 보존.
  - needless_continue / used_underscore_binding / nonminimal_bool / unreadable_literal / many_single_char_names / doc_link_with_quotes / assigning_clones / collapsible_str_replace / trivial_regex / elidable_lifetime_names / range_plus_one / explicit_iter_loop / implicit_hasher / ref_option — remaining low-value style.
- 각 24 crate `Cargo.toml` 에 `[lints] workspace = true` 추가.

## Auto-fix

`cargo clippy --workspace --all-targets --fix` 적용 — 128 files changed, 552 insertions / 472 deletions. 주로:
- uninlined_format_args (~18): `format!("{}", x)` → `format!("{x}")`.
- redundant_closure_for_method_calls (~33): `.map(|x| x.foo())` → `.map(T::foo)`.
- 그 외 mechanical refactor.

## 검증

- `cargo clippy --workspace --all-targets -j 1 -- -D warnings` clean (pedantic + 모든 lint group).
- `cargo test --workspace --no-fail-fast -j 1` — **1293 tests pass + 1 pre-existing flaky fail** (`kebab-mcp::tools_call_ask_multi_hop::ask_tool_routes_multi_hop_true_to_decompose_first`, HOTFIX candidate, cleanup 무관). 회귀 0.

Wire 영향: 없음.
Behavior 영향: 없음 (mechanical refactor only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 03:01:58 +00:00
th-kim0823
bf4ebf8d2a feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang
Four optional, serde-skipped-when-None fields added to `Metadata` for
code ingest context. All 11 downstream construction sites patched with
`repo: None, git_branch: None, git_commit: None, code_lang: None`.
Full workspace check (`--tests`) and per-crate test suite pass clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:44:18 +09:00
f867b36afb feat(kebab-core): p9-fb-23 task 2 — CanonicalDocument gains last_chunker_version + last_embedding_version
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:50:25 +00:00
8de08cf38c review(p7-1): 회차 1 지적 반영
- Cargo.toml: 사용하지 않는 deps 제거 (`kebab-config`, `thiserror`,
  `pdf-extract`, dev `tempfile` / `serde_json` / `serde`). 특히
  `pdf-extract` 가 끌어오던 transitive ~150 crate (pom, postscript,
  type1-encoding-parser, adobe-cmap-parser, euclid, chrono, md5,
  linked-hash-map …) 가 모두 사라짐. lopdf 만 남음.
- info.rs: BOM 없는 PDFDocEncoded Title 디코드 버그 수정. `from_utf8_lossy`
  는 0x80–0xFF 를 U+FFFD 로 치환해 "Café" 같은 레거시 타이틀을 망가뜨림.
  byte → `char` 직접 캐스팅 (Latin-1 디코더) 로 교체. 회귀 테스트
  `info_dict_title_pdfdocencoding_latin1_high_bytes_decoded` 추가.
- info.rs: 모듈 doc 의 "Latin-1 superset" 부정확 표현 정정 — PDFDocEncoding
  은 0x18–0x1F / 0x80–0x9F 영역에서 Latin-1 과 다름.
- lib.rs: `saturating_sub(1)` 가 page=0 케이스를 silent 흡수하던 부분에
  `debug_assert!` 추가. release 는 saturating fallback 유지 (panic 보다
  garbled order 가 운영에 유리).
- tests: UTF-16 surrogate pair 커버리지 갭 보완 — 🥙 (U+1F959) 가 포함된
  타이틀로 `String::from_utf16_lossy` 의 페어-결합 경로 검증.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 08:40:40 +00:00
5a158d7343 feat(kebab-parse-pdf): P7-1 text PDF extractor — per-page CanonicalDocument
`PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출.
페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }`
emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning`
("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력.

핵심 동작:
- `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러
  (`qpdf --decrypt` 안내).
- 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed
  page panic 을 recoverable warning 으로 변환.
- `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM
  prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리).
- 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse,
  corrupt header error, page_count 메타, UTF-16BE Title, filename
  fallback, determinism, snapshot.

`parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7`
(원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 08:34:55 +00:00