v0.20.0 sub-item 1 dogfood report 의 Bug #4 — F4 mojibake.pdf 의 lopdf
`get_pages()` count = 0 (Pages tree broken). root cause = 기존 byte-
level `re.sub` + manual startxref edit 가 lopdf strict load 통과시키지만
Pages dict 의 `/Kids` reference 깨짐.
- `tests/fixtures/_synth/mojibake.py`: full rewrite — replace byte-level
`re.sub` + manual startxref with pikepdf open+inject-dummy-ToUnicode+
del+save (auto xref regen). HYSMyeongJo-Medium CID font: CID font 이
ToUnicode 를 자체 생성하지 않아 dummy stream 을 inject 후 strip
(removed=1 invariant). Exit codes 2/3/4 for invariant fail.
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`: regenerate via
pikepdf — 1 valid page, no /ToUnicode marker, byte-identical 후 reproducible.
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json`:
regen via 2-run cargo test pattern (hand-rolled unwrap_or_else baseline
bootstrap, no insta crate).
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs`: append 3
invariant test — (1) lopdf 1-page, (2) /ToUnicode marker absent,
(3) PdfTextExtractor 1-block invariant.
- `crates/kebab-parse-pdf/src/text_quality.rs`: f4_fixture_ratio_under_threshold
threshold 0.3 → 0.5 (production valid_ratio_threshold 기본값). 구 broken
fixture (pages=0) 는 extract_text="" → ratio=0.0; 신 fixed fixture 는
CID 2-byte fallback decode → ratio≈0.375 — 여전히 OCR trigger 조건 충족.
spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md (§5)
plan: docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md (Step 4)
prior: 241ded5 (Step 3 integration test)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
66 lines
1.8 KiB
JSON
66 lines
1.8 KiB
JSON
{
|
||
"doc_id": "c90fae7576fe514fb08190cb29d1ef5d",
|
||
"source_asset_id": "babe9824b6b28237c0898575a40ba48d",
|
||
"workspace_path": "mojibake.pdf",
|
||
"title": "untitled",
|
||
"lang": "und",
|
||
"blocks": [
|
||
{
|
||
"kind": "paragraph",
|
||
"common": {
|
||
"block_id": "22bb97fc37da5c55c099e2763f95ffd9",
|
||
"heading_path": [],
|
||
"source_span": {
|
||
"kind": "page",
|
||
"page": 1,
|
||
"char_start": 0,
|
||
"char_end": 64
|
||
}
|
||
},
|
||
"text": "\n<><6E><EFBFBD><EFBFBD><EFBFBD>\u0014<31>\u0000\u0000 <20>=¤̘\u0000 \u0014\u0000 <20> <20><><EFBFBD>T<EFBFBD><54>\u0000 <20><><EFBFBD>L\n<>\\<5C>mŴ\u0000 <20>8ǐ<38>\u0000\u0000 <20>h<EFBFBD><68><EFBFBD><EFBFBD>\u0000 <20><>ư\u0000.\n",
|
||
"inlines": [
|
||
{
|
||
"kind": "text",
|
||
"text": "\n<><6E><EFBFBD><EFBFBD><EFBFBD>\u0014<31>\u0000\u0000 <20>=¤̘\u0000 \u0014\u0000 <20> <20><><EFBFBD>T<EFBFBD><54>\u0000 <20><><EFBFBD>L\n<>\\<5C>mŴ\u0000 <20>8ǐ<38>\u0000\u0000 <20>h<EFBFBD><68><EFBFBD><EFBFBD>\u0000 <20><>ư\u0000.\n"
|
||
}
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"aliases": [],
|
||
"tags": [],
|
||
"created_at": "1970-01-01T00:00:00Z",
|
||
"updated_at": "1970-01-01T00:00:00Z",
|
||
"source_type": "paper",
|
||
"trust_level": "primary",
|
||
"user_id_alias": null,
|
||
"user": {
|
||
"pdf": {
|
||
"creator": "anonymous",
|
||
"page_count": 1,
|
||
"producer": "ReportLab PDF Library - (opensource)"
|
||
}
|
||
}
|
||
},
|
||
"provenance": {
|
||
"events": [
|
||
{
|
||
"at": "1970-01-01T00:00:00Z",
|
||
"agent": "kb-source-fs",
|
||
"kind": "discovered",
|
||
"note": null
|
||
},
|
||
{
|
||
"at": "1970-01-01T00:00:00Z",
|
||
"agent": "kb-parse-pdf",
|
||
"kind": "parsed",
|
||
"note": "parser_version=pdf-text-v1; page_count=1"
|
||
}
|
||
]
|
||
},
|
||
"parser_version": "pdf-text-v1",
|
||
"schema_version": 1,
|
||
"doc_version": 1,
|
||
"last_chunker_version": null,
|
||
"last_embedding_version": null
|
||
} |