f3a7222ec5
fix(ocr): PR #206 round-1 리뷰 반영 — 골든 CI 테스트 + PDF 튜닝 문서 + threshold const + mutex 복구
...
- [MEDIUM] 골든 CI 단위테스트 2건 추가: ctc_greedy_decode_golden (argmax_idx
one-hot → decoded 문자열 검증), det_box_score_golden (box_score/unclip_rect
golden corner 검증). 모델/ONNX 불요, CI 상주.
ctc_greedy_decode를 자유 함수(ctc_greedy_decode_with_dict)로 추출하여 테스트
가능하게 함.
- [MEDIUM] PDF paddle 튜닝 비대칭 문서화: build_pdf_ocr_engine에 paddle-onnx가
image.ocr.* 사용(pdf.ocr.* 아님) 이유 명시 + PdfOcrCfg.engine 필드 doc 갱신.
- [MEDIUM] DBNet 이진화 매직넘버 0.3 → DET_BIN_THRESH const 추출 + score_thresh
기본값 느슨한 이유 1줄 주석.
- [LOW] Mutex poison 복구: det/rec .expect("poisoned") →
.unwrap_or_else(PoisonError::into_inner). 자산 panic이 ingest abort 안 되도록.
- [LOW] DetBox.score dead field 제거 (box_score 결과는 필터에만 사용, 저장 불요).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-06-04 09:13:27 +00:00
3d5bb599e3
feat(ocr): bundle PP-OCRv5 ONNX models (det 4.7MB + rec 13MB)
...
paddle-onnx engine assets — committed as plain binary blobs (git-lfs not
installed on this host; see .gitattributes for the LFS migration recipe).
NOTICE (Apache-2.0) + korean_dict.txt already tracked. Loaded by default from
this dir or KEBAB_IMAGE_OCR_MODEL_DIR.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com >
2026-06-04 08:36:34 +00:00
375a0693e4
chore(ocr): T11/T12 — clippy clean + docs + v0.27.0 bump
...
T11: fix 12 clippy lints in paddle_onnx.rs/paddle_e2e.rs (doc overindent,
finish_non_exhaustive, map_or_else, RangeInclusive::contains, cast_lossless,
is_some_and, usize::from). Full-workspace clippy -D warnings = 0.
Smoke (paddle-onnx, real binary): clean_paragraph OCR verbatim-correct, real
per-region confidence (0.99/0.96/0.95), FTS5 lexical hit on Korean(검색)+
English(embedding), parser_version folds |ocr:1:paddle-onnx:<ver>. Big page
<4s inference (5.6s ingest incl. one-time session load).
T12: README [image.ocr].engine + ARCHITECTURE OCR row + SMOKE paddle-onnx config
+ HANDOFF + HOTFIXES dated entry. Workspace version 0.26.2 → 0.27.0 (minor:
new engine value + config keys). .gitattributes: onnx as plain blobs (no git-lfs).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com >
2026-06-04 08:36:19 +00:00
8cc4e6d563
fix(ocr): T10/T11 — unclip edge-offset (CER 0.26→0.005) + e2e gate + error tests
...
Root cause found at T11 e2e: unclip_rect pushed corners radially from the
centroid. For a wide/short text box the diagonal is near-horizontal, so the box
barely grew in height and clipped character tops (ㄷ→ㄴ, 다→나). Rewrote unclip
as a proper per-edge polygon offset along the rect's own (u,v) axes — height and
width each grow by 2*distance, matching PaddleOCR pyclipper.
Result (synthetic-ocr-bench, real inference): mean gate CER 0.2585 → 0.0049
(clean_paragraph/korean_heavy/numbers_table/tech_terms = 0.0), beating the
0.976 PoC baseline. Big page 3.9s < 5s.
T10: dict-length-mismatch construction error + undecodable-bytes recognize error.
T11 e2e: tests/paddle_e2e.rs CER<=0.05 gate (skips cleanly when assets absent).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com >
2026-06-04 08:22:47 +00:00
901416d8e9
feat(ocr): T7-T9 — config overrides + engine factory + signature cascade
...
T7: OcrCfg gains det_model/rec_model/dict overrides + score_thresh/
unclip_ratio/max_boxes (serde default, KEBAB_IMAGE_OCR_* env). OnnxPaddleOcr::new
threads them via ModelPaths::from_config.
T8: build_image_ocr_engine / build_pdf_ocr_engine factories return
Box<dyn OcrEngine>; match on engine string (ollama-vision|paddle-onnx|err).
ImagePipeline.ocr_engine + pdf_ocr_engine signatures switched to &dyn OcrEngine.
OcrEngine gains model() for the progress label.
T9: ingest_config_signature image/pdf branches emit |ocr:1:{engine}:{engine_version}
(memoized blake3 per asset-triple, m3-safe). Unit tests (a)(b)(c) added.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com >
2026-06-04 08:15:30 +00:00
b706e3e88c
feat(ocr): T2-T6 OnnxPaddleOcr core engine — det/rec ONNX + DBNet postproc + CTC
...
PP-OCRv5 ONNX OCR engine on the pinned ort rc.9 (no Python, no oar-ocr dep).
Implements the recognize() pipeline end-to-end (compiles + unit-tested):
- T2: OnnxPaddleOcr skeleton, OcrEngine impl, det/rec Session loaded once
(Mutex-wrapped → Send+Sync), engine_version = blake3(det+rec+dict) cached
once at construction, dict bounds-check (11945 lines vs 11947 rec classes).
- T2 preproc: det ImageNet mean/std NCHW + limit_side_len 960 → ×32 round
(golden 192x900→896x192 pinned); rec height-48 keep-aspect, (x-0.5)/0.5.
- T3 det postproc: threshold 0.3 → imageproc contours → min-area rect via
pure-Rust rotating calipers + convex hull → mean-prob box-score filter →
pure-Rust unclip(ratio 1.5). No clipper2/OpenCV.
- T4 crop+rectify: corner ordering + bilinear perspective warp to horizontal.
- T5 rec+CTC: greedy decode with the T0a-confirmed mapping
(idx0=blank, 1..=11945=dict[idx-1], 11946=space), rec-class bounds-check.
- T6 assembly: reading-order OcrText with per-region bbox + real confidence.
Unit tests (4 pass): det_target_dims golden, convex hull, min-area rect,
unclip expansion. Large *.onnx assets stay untracked pending T12 LFS decision.
Remaining: T7 config overrides, T8 factory (4 sites), T9 signature cascade,
T10 error matrix, T11 gates (clippy/e2e CER), T12 docs+bump+PR.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com >
2026-06-04 07:52:39 +00:00
8f8d3a4100
feat(ocr): T0a/T0/T1 — golden harness(CTC blank=0 도출) + deps(ort rc.9) + dict/NOTICE
...
T0a: onnxruntime 직접 골든 하네스 → CTC blank/dict 매핑 경험 확정(gt CER 0.000).
T0: 모델 번들 dict+NOTICE(.onnx 는 T12 LFS 결정까지 워크트리 보관).
T1: ort(download-binaries)+imageproc 추가, cargo tree ort rc.9 단일 확인.
2026-06-04 07:43:53 +00:00
75a543ff69
docs(ocr): PP-OCRv5 ONNX Rust 네이티브 OCR spec + plan
2026-06-04 07:31:03 +00:00