fix(ocr): PR #206 round-1 리뷰 반영 — 골든 CI 테스트 + PDF 튜닝 문서 + threshold const + mutex 복구

- [MEDIUM] 골든 CI 단위테스트 2건 추가: ctc_greedy_decode_golden (argmax_idx
  one-hot → decoded 문자열 검증), det_box_score_golden (box_score/unclip_rect
  golden corner 검증). 모델/ONNX 불요, CI 상주.
  ctc_greedy_decode를 자유 함수(ctc_greedy_decode_with_dict)로 추출하여 테스트
  가능하게 함.
- [MEDIUM] PDF paddle 튜닝 비대칭 문서화: build_pdf_ocr_engine에 paddle-onnx가
  image.ocr.* 사용(pdf.ocr.* 아님) 이유 명시 + PdfOcrCfg.engine 필드 doc 갱신.
- [MEDIUM] DBNet 이진화 매직넘버 0.3 → DET_BIN_THRESH const 추출 + score_thresh
  기본값 느슨한 이유 1줄 주석.
- [LOW] Mutex poison 복구: det/rec .expect("poisoned") →
  .unwrap_or_else(PoisonError::into_inner). 자산 panic이 ingest abort 안 되도록.
- [LOW] DetBox.score dead field 제거 (box_score 결과는 필터에만 사용, 저장 불요).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-06-04 09:13:27 +00:00
parent 3d5bb599e3
commit f3a7222ec5
3 changed files with 145 additions and 49 deletions

View File

@@ -561,7 +561,9 @@ pub struct PdfOcrCfg {
/// scanned pages only. `true` — vision LLM 호출 on every page
/// (vector PDF 의 dual-text confidence boost — doubles chunk count).
pub always_on: bool,
/// Engine identifier. v1 only ships `"ollama-vision"`.
/// Engine identifier: `"ollama-vision"` or `"paddle-onnx"`. When set to
/// `"paddle-onnx"`, model paths and tuning knobs are read from
/// `[image.ocr]`, not `[pdf.ocr]` — PaddleOCR has no PDF-specific tuning.
pub engine: String,
/// Vision model id. Default `"qwen2.5vl:3b"` per PoC (§3.5 family
/// asymmetry vs image OCR's gemma4:e4b is acknowledged).