fix(ocr): PR #206 round-1 리뷰 반영 — 골든 CI 테스트 + PDF 튜닝 문서 + threshold const + mutex 복구
- [MEDIUM] 골든 CI 단위테스트 2건 추가: ctc_greedy_decode_golden (argmax_idx
one-hot → decoded 문자열 검증), det_box_score_golden (box_score/unclip_rect
golden corner 검증). 모델/ONNX 불요, CI 상주.
ctc_greedy_decode를 자유 함수(ctc_greedy_decode_with_dict)로 추출하여 테스트
가능하게 함.
- [MEDIUM] PDF paddle 튜닝 비대칭 문서화: build_pdf_ocr_engine에 paddle-onnx가
image.ocr.* 사용(pdf.ocr.* 아님) 이유 명시 + PdfOcrCfg.engine 필드 doc 갱신.
- [MEDIUM] DBNet 이진화 매직넘버 0.3 → DET_BIN_THRESH const 추출 + score_thresh
기본값 느슨한 이유 1줄 주석.
- [LOW] Mutex poison 복구: det/rec .expect("poisoned") →
.unwrap_or_else(PoisonError::into_inner). 자산 panic이 ingest abort 안 되도록.
- [LOW] DetBox.score dead field 제거 (box_score 결과는 필터에만 사용, 저장 불요).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -855,6 +855,17 @@ fn build_image_ocr_engine(
|
||||
/// endpoint fallback to `models.llm.endpoint`). The paddle-onnx arm shares
|
||||
/// the same bundled ONNX models as image OCR (resolved from `image.ocr`
|
||||
/// overrides) — PaddleOCR is page-agnostic and carries no per-engine prompt.
|
||||
///
|
||||
/// # Paddle-ONNX asymmetry
|
||||
///
|
||||
/// When `pdf.ocr.engine = "paddle-onnx"`, the model paths and tuning knobs
|
||||
/// (`det_model`, `rec_model`, `dict`, `score_thresh`, `unclip_ratio`,
|
||||
/// `max_boxes`, `max_pixels`) are read from **`[image.ocr]`**, not
|
||||
/// `[pdf.ocr]`. PaddleOCR has no PDF-specific prompt or page-level config;
|
||||
/// `[pdf.ocr]` fields other than `engine` / `enabled` / `always_on` /
|
||||
/// `valid_ratio_threshold` / `min_char_count` / `lang_hint` are effectively
|
||||
/// ignored for the paddle path. This asymmetry is intentional — one set of
|
||||
/// tuned ONNX knobs serves both image and PDF pages.
|
||||
fn build_pdf_ocr_engine(
|
||||
config: &kebab_config::Config,
|
||||
) -> anyhow::Result<Box<dyn OcrEngine>> {
|
||||
|
||||
Reference in New Issue
Block a user