Files
kebab/tests/fixtures/_synth/flate_ccittfax.sh
altair823 aeeff3635b poc+test(pdf-ocr): lopdf /Filter probe + 5 fixture commit (F1/F2/F4/F6/F7) for v0.20 sub-item 1
Step 2 (Group B) of v0.20.0 sub-item 1 (scanned PDF OCR) plan.

B1 — lopdf /Filter probe (Python re + shell grep on synthesized fixtures,
result appended to docs/superpowers/poc/2026-05-27-pdf-ocr-engine-comparison.md).

Key findings:
- reportlab default (useA85=1) yields /Filter [ /ASCII85Decode /DCTDecode ];
  useA85=0 gives pure /Filter [ /DCTDecode ] with JPEG magic ffd8ffe0.
- Pillow RGB.save('.pdf','PDF') uses DCTDecode — F6 FlateDecode requires
  manual PDF construction via zlib.compress.
- ghostscript pdfwrite rejects TIFF input (/undefined in II*) —
  ImageMagick `convert -compress Group4` used for F7 CCITTFax.

B2 — 5 fixture 합성·commit under crates/kebab-parse-pdf/tests/fixtures/:
- F1 scanned_page1.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page1-clean.png, 한국어).
- F2 scanned_page2.pdf — /Filter [ /DCTDecode ], JPEG magic ffd8ffe0 (page2-clean.png, 받침).
- F4 mojibake.pdf       — DejaVu TTF + ToUnicode CMap stripped (count=0);
                          Noto CJK TTC has PostScript outlines unsupported by reportlab.
- F6 flate_raw.pdf      — /Filter /FlateDecode, DCTDecode absent (skip path input).
- F7 ccitt.pdf          — /Filter [ /CCITTFaxDecode ], DCTDecode absent (skip path input).

Synth scripts under tests/fixtures/_synth/:
- scanned_pdf.py    — F1/F2 reportlab drawImage + JPEG passthrough (useA85=0).
- mojibake.py       — F4 reportlab DejaVu TTF + ToUnicode strip.
- flate_ccittfax.sh — F6 manual zlib PDF + F7 Pillow TIFF group4 + ImageMagick convert.

spec:  docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md (§5.1)
plan:  docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md (Step 2 B1+B2)
contract: §9 (additive minor wire bump — 후속 step)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 04:04:47 +00:00

95 lines
3.3 KiB
Bash

#!/usr/bin/env bash
# Synthesize F6 (FlateDecode) and F7 (CCITTFaxDecode) fixtures.
#
# F6: Pillow Image.save('.pdf', 'PDF') default = FlateDecode raw pixel.
# F7: Pillow bilevel TIFF (group4) + ghostscript pdfwrite.
#
# Usage (from repo root):
# bash tests/fixtures/_synth/flate_ccittfax.sh
set -euo pipefail
FIXTURES="crates/kebab-parse-pdf/tests/fixtures"
# --- F6: FlateDecode raw pixel ---
# Pillow RGB->PDF uses DCTDecode by default; write a minimal PDF manually.
python3 -c "
import zlib
def make_flatedecode_pdf(dst_path, width=300, height=200):
raw = b'\xff\xff\xff' * width * height
compressed = zlib.compress(raw, level=6)
content = f'q {width} 0 0 {height} 0 0 cm /Im Do Q'.encode()
buf = b'%PDF-1.4\n'
offsets = {}
offsets[1] = len(buf)
buf += b'1 0 obj\n<< /Type /Catalog /Pages 2 0 R >>\nendobj\n'
offsets[2] = len(buf)
buf += b'2 0 obj\n<< /Type /Pages /Kids [3 0 R] /Count 1 >>\nendobj\n'
offsets[3] = len(buf)
buf += (f'3 0 obj\n<< /Type /Page /Parent 2 0 R '
f'/MediaBox [0 0 {width} {height}] '
f'/Contents 4 0 R '
f'/Resources << /XObject << /Im 5 0 R >> >> >>\nendobj\n').encode()
offsets[4] = len(buf)
buf += f'4 0 obj\n<< /Length {len(content)} >>\nstream\n'.encode()
buf += content + b'\nendstream\nendobj\n'
offsets[5] = len(buf)
hdr = (f'5 0 obj\n<< /Type /XObject /Subtype /Image '
f'/Width {width} /Height {height} '
f'/ColorSpace /DeviceRGB /BitsPerComponent 8 '
f'/Filter /FlateDecode '
f'/Length {len(compressed)} >>\nstream\n').encode()
buf += hdr + compressed + b'\nendstream\nendobj\n'
xref_offset = len(buf)
buf += b'xref\n0 6\n0000000000 65535 f \n'
for i in range(1, 6):
buf += f'{offsets[i]:010d} 00000 n \n'.encode()
buf += (f'trailer\n<< /Size 6 /Root 1 0 R >>\nstartxref\n{xref_offset}\n%%EOF\n').encode()
with open(dst_path, 'wb') as f:
f.write(buf)
print(f'F6 wrote {dst_path} ({len(buf)} bytes)')
make_flatedecode_pdf('${FIXTURES}/flate_raw.pdf')
"
echo "F6 verify:"
python3 -c "
import re
data = open('${FIXTURES}/flate_raw.pdf', 'rb').read()
filters = re.findall(rb'/Filter\s*[\[/][^\]\n]{0,40}', data)
print(' Filters:', [f.decode(errors='replace') for f in filters])
"
# --- F7: CCITTFaxDecode bilevel ---
python3 -c "
from PIL import Image, ImageDraw, ImageFont
im = Image.new('1', (600, 800), 1)
draw = ImageDraw.Draw(im)
try:
font = ImageFont.truetype('/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc', 16, index=1)
draw.text((50, 50), 'test ccittfax', fill=0, font=font)
except Exception:
draw.text((50, 50), 'test ccittfax', fill=0)
im.save('/tmp/ccitt.tif', 'TIFF', compression='group4')
print('TIFF wrote')
"
gs -dNOPAUSE -dBATCH -dQUIET \
-sDEVICE=pdfwrite \
-dCompressFonts=false \
-dEncodeMonoImages=true \
-dMonoImageFilter=/CCITTFaxEncode \
"-sOutputFile=${FIXTURES}/ccitt.pdf" \
/tmp/ccitt.tif
rm -f /tmp/ccitt.tif
echo "F7 wrote: ${FIXTURES}/ccitt.pdf ($(stat -c%s "${FIXTURES}/ccitt.pdf") bytes)"
echo "F7 verify:"
if grep -qc "/CCITTFax" "${FIXTURES}/ccitt.pdf" 2>/dev/null; then
echo " CCITTFax found (OK)"
else
echo " WARNING: CCITTFax not found -- ghostscript may have re-encoded"
grep -ao "/Filter[ /][A-Za-z]*" "${FIXTURES}/ccitt.pdf" | sort -u || true
fi