Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).
Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
ImageMagick `convert -compress Group4` used for F7 CCITTFax.
B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf — DejaVu TTF + ToUnicode CMap stripped (count=0);
Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).
Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
872 B
872 B