Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.
B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).
Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
ImageMagick `convert -compress Group4` used for F7 CCITTFax.
B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf — DejaVu TTF + ToUnicode CMap stripped (count=0);
Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).
Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.
spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
49 lines
1.7 KiB
Python
49 lines
1.7 KiB
Python
"""Synthesize mojibake fixture -- Type 0 font PDF without ToUnicode CMap.
|
|
|
|
Strategy:
|
|
1. reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
|
|
2. Generated PDF byte stream 에서 `/ToUnicode <ref>` 항목 + 해당 CMap stream 제거.
|
|
|
|
Usage:
|
|
python3 tests/fixtures/_synth/mojibake.py \
|
|
crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
|
|
"""
|
|
import sys, re
|
|
from pathlib import Path
|
|
from reportlab.lib.pagesizes import A4
|
|
from reportlab.lib.units import mm
|
|
from reportlab.pdfbase import pdfmetrics
|
|
from reportlab.pdfbase.ttfonts import TTFont
|
|
from reportlab.pdfgen import canvas
|
|
|
|
# Noto CJK TTC uses PostScript outlines which reportlab does not support.
|
|
# Use DejaVu Sans TTF (always available on Ubuntu) instead -- the fixture's
|
|
# invariant is /ToUnicode CMap absent, not a specific script.
|
|
DEJAVU_TTF = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
|
|
FONT_NAME = "DejaVuSans"
|
|
pdfmetrics.registerFont(TTFont(FONT_NAME, DEJAVU_TTF))
|
|
|
|
dst = Path(sys.argv[1])
|
|
|
|
# Step 1: 정상 PDF 합성
|
|
c = canvas.Canvas(str(dst), pagesize=A4)
|
|
c.setFont(FONT_NAME, 12)
|
|
y = A4[1] - 30*mm
|
|
for line in ["Mojibake fixture (no ToUnicode CMap)", "Text extraction yields garbage \x00\x01\x02"]:
|
|
c.drawString(30*mm, y, line)
|
|
y -= 16
|
|
|
|
c.save()
|
|
|
|
# Step 2: ToUnicode CMap 제거 (best-effort byte-level rewrite)
|
|
data = dst.read_bytes()
|
|
# pattern: "/ToUnicode <objref>" -- referenced indirect object 의 stream 까지 제거
|
|
new_data = re.sub(rb"/ToUnicode\s+\d+\s+\d+\s+R\b", b"", data)
|
|
|
|
if new_data == data:
|
|
print("WARNING: /ToUnicode reference not found -- Tier 1 failed, try Tier 2", file=sys.stderr)
|
|
sys.exit(2)
|
|
|
|
dst.write_bytes(new_data)
|
|
print(f"wrote {dst} ({dst.stat().st_size} bytes, ToUnicode stripped)")
|