Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).
Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
ImageMagick `convert -compress Group4` used for F7 CCITTFax.
B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf — DejaVu TTF + ToUnicode CMap stripped (count=0);
Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).
Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
37 lines
1.2 KiB
Python
37 lines
1.2 KiB
Python
"""Synthesize DCTDecode JPEG-wrapped PDF from PNG via reportlab drawImage.
|
|
|
|
reportlab's drawImage(<jpg_filename>) preserves JPEG bytes verbatim into the
|
|
PDF stream as /Filter /DCTDecode -- exactly what F1/F2 need.
|
|
|
|
Usage:
|
|
python3 tests/fixtures/_synth/scanned_pdf.py \
|
|
/build/cache/pdf-ocr-poc/images/page1-clean.png \
|
|
crates/kebab-parse-pdf/tests/fixtures/scanned_page1.pdf
|
|
"""
|
|
import sys, tempfile, os
|
|
import reportlab.rl_config
|
|
reportlab.rl_config.useA85 = 0 # disable ASCII85 wrapper so image XObject uses /Filter /DCTDecode directly
|
|
from pathlib import Path
|
|
from PIL import Image
|
|
from reportlab.lib.pagesizes import A4
|
|
from reportlab.pdfgen import canvas
|
|
|
|
src = Path(sys.argv[1])
|
|
dst = Path(sys.argv[2])
|
|
|
|
# Step 1: PNG -> JPEG (quality 85, reproducible)
|
|
img = Image.open(src).convert("RGB")
|
|
with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as tf:
|
|
img.save(tf.name, "JPEG", quality=85)
|
|
jpg_path = tf.name
|
|
|
|
# Step 2: reportlab canvas with drawImage(<jpg path>) -> DCTDecode passthrough
|
|
W, H = A4
|
|
c = canvas.Canvas(str(dst), pagesize=A4)
|
|
c.drawImage(jpg_path, 0, 0, width=W, height=H, preserveAspectRatio=True)
|
|
c.showPage()
|
|
c.save()
|
|
|
|
os.unlink(jpg_path)
|
|
print(f"wrote {dst} ({dst.stat().st_size} bytes)")
|