Files
kebab/tests/fixtures/_synth/mojibake.py
altair823 aeeff3635b poc+test(pdf-ocr): lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) for v0.20 sub-item 1
Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).

Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
  useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
  manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
  ImageMagick `convert -compress Group4` used for F7 CCITTFax.

B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf       — DejaVu TTF + ToUnicode CMap stripped (count=0);
                          Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf      — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf          — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).

Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py    — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py       — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 04:04:47 +00:00

49 lines
1.7 KiB
Python

"""Synthesize mojibake fixture -- Type 0 font PDF without ToUnicode CMap.
Strategy:
1. reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
2. Generated PDF byte stream 에서 `/ToUnicode <ref>` 항목 + 해당 CMap stream 제거.
Usage:
python3 tests/fixtures/_synth/mojibake.py \
crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
"""
import sys, re
from pathlib import Path
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
# Noto CJK TTC uses PostScript outlines which reportlab does not support.
# Use DejaVu Sans TTF (always available on Ubuntu) instead -- the fixture's
# invariant is /ToUnicode CMap absent, not a specific script.
DEJAVU_TTF = "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf"
FONT_NAME = "DejaVuSans"
pdfmetrics.registerFont(TTFont(FONT_NAME, DEJAVU_TTF))
dst = Path(sys.argv[1])
# Step 1: 정상 PDF 합성
c = canvas.Canvas(str(dst), pagesize=A4)
c.setFont(FONT_NAME, 12)
y = A4[1] - 30*mm
for line in ["Mojibake fixture (no ToUnicode CMap)", "Text extraction yields garbage \x00\x01\x02"]:
c.drawString(30*mm, y, line)
y -= 16
c.save()
# Step 2: ToUnicode CMap 제거 (best-effort byte-level rewrite)
data = dst.read_bytes()
# pattern: "/ToUnicode <objref>" -- referenced indirect object 의 stream 까지 제거
new_data = re.sub(rb"/ToUnicode\s+\d+\s+\d+\s+R\b", b"", data)
if new_data == data:
print("WARNING: /ToUnicode reference not found -- Tier 1 failed, try Tier 2", file=sys.stderr)
sys.exit(2)
dst.write_bytes(new_data)
print(f"wrote {dst} ({dst.stat().st_size} bytes, ToUnicode stripped)")