--- title: "v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text" created: 2026-05-27 status: "DRAFT round 1c" parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md contract_sections: ["§1.3 (text-detect threshold metric)", "§9 (version cascade)"] related_specs: - docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md - docs/superpowers/specs/2026-04-27-kebab-final-form-design.md related_dogfood: - .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md (Bug #6 + #7) --- # v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text ## §1 Problem statement ### §1.1 Bug #6: Identity-H Unimplemented marker bypasses mojibake detection **Symptom**: `metro-korea.pdf` (58 MB, Identity-H CID font without ToUnicode CMap) ingests with `pdf_ocr_pages=0`. Full text contains `?Identity-H Unimplemented?` marker 1154 times. All 21 pages + 34 chunks are indexable, but content is unusable garbage — repeated marker literal instead of readable text. **Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs` lines 9-37. The function `compute_valid_char_ratio()` via `is_valid_text_char()` treats ASCII printable range `0x0020..=0x007E` as unconditionally valid. lopdf emits `?Identity-H Unimplemented?` (28 ASCII printable chars) when it cannot decode a CID font lacking ToUnicode CMap. Result: valid_ratio = 1.0 → exceeds OCR fallback threshold 0.5 → text-detect first-pass incorrectly classifies mojibake as valid text → `pdf_ocr_pages` stays 0, no OCR fallback triggered. **Design intent deviation**: Parent spec §1.3 (line 74) explicitly states "ratio metric judges mojibake page as scanned candidate." PoC example "֥ᬵᯝ₞e ࠦᯱᖝ░" (custom font, no ToUnicode) should trigger OCR. **Implementation gap**: literal ASCII marker case (Identity-H font) was not anticipated. ### §1.2 Bug #7: CLI `--media` help text omits `code` from valid values **Symptom**: `kebab search --help` lists `--media` accepted values as "markdown, pdf, image, audio, other" — `code` is missing. **Actual behavior**: `kebab search "main" --media code --json -k 5` returns 5 hits (code/script.sh, code/rust_sample/src/main.rs, etc.). Schema `media_breakdown` includes `code: 6` as first-class. Functional correctness is complete; **CLI doc-comment is outdated only**. **Root cause**: `crates/kebab-cli/src/main.rs:148-165`. SearchArgs `--media` field clap doc-comment omits `code`. clap's `--help` renderer quotes this doc-comment directly. --- ## §2 Scope + non-scope ### §2.1 Included in this spec - **Bug #6 fix**: Add known mojibake marker stripping to `compute_valid_char_ratio()`. - **Bug #6 test**: Three new unit tests covering Identity-H / Identity-V markers (full-text, mixed-text cases). - **Bug #6 regression**: Verify existing 8 text_quality unit tests remain green. - **Bug #7 fix**: Update CLI `--media` doc-comment to include `code`. - **Bug #7 test**: Assert that `kebab search --help` output contains "`code`" substring. - **Traceability**: Link both fixes to parent spec §1.3 design intent. ### §2.2 Explicitly out of scope **Bug #8 candidate (falsified)**: V007 trigram tokenizer already applied; 2-character query limitation is design-level constraint, not a bug. Handled in prior dogfood report §Bug #8. **Non-bug observations**: - `--readonly + ingest` exit=0: Graceful refusal per CLAUDE.md intent (exit codes 0/1/2/3 unchanged; `error.v1.code` handles agent branching). - Ask phrasing-sensitive refusal: RAG corner case; not a code defect. - Binary staleness: Environmental artifact, not applicable to spec. **Ancillary risks**: - Scan for other `--media` doc-comment locations (R-4): Plan drafter to use grep; not blockers for this spec. - Other lopdf unimplemented markers (R-1): Plan drafter to inspect lopdf source; marker array is extensible. --- ## §3 Decisions ### §3.1 Bug #6: Known mojibake marker stripping Strip known mojibake marker substrings **before ratio calculation**, then force ratio to 0.0 if remainder is empty after marker removal. When stripped characters exceed remaining characters (marker dominance), cap ratio at 0.3 to trigger OCR fallback on marker-heavy mixed pages. **Rationale**: lopdf's unimplemented CID font handling consistently emits specific ASCII marker strings. Hardcoding them is lightweight, deterministic, and covers the known failure mode without requiring expensive heuristics (e.g., ML-based gibberish detection). Pages like `metro-korea.pdf` may contain mostly mojibake body text with small valid headers; the marker-dominance check ensures such pages fall below the 0.5 OCR threshold. **Marker list**: `?Identity-H Unimplemented?` only. lopdf 0.32.0 emits exactly one marker (verified per critic round 1 probe). Extensible if future lopdf versions emit additional markers. ### §3.2 Bug #7: CLI doc-comment update Add `code` to the comma-separated list of valid `--media` values in the SearchArgs field's clap doc-comment. Single-line edit; no functional or schema changes. ### §3.3 Parent spec traceability Both fixes uphold parent spec §1.3: - Bug #6 ensures mojibake pages (Text CMap-missing fonts) trigger OCR fallback per design intent. - Bug #7 corrects CLI documentation to match actual schema (first-class `code` media type supported since v0.18.0). No changes to parser_version, chunker_version, or wire schema. --- ## §4 Implementation specification ### §4.1 Bug #6: text_quality.rs diff **File**: `crates/kebab-parse-pdf/src/text_quality.rs` **Change**: 1. Add constant array of known mojibake markers (lines 8–10): ```rust // Source of truth: lopdf-0.32.0/src/document.rs:523 (Document::decode_text). // Only one Unimplemented marker is emitted by lopdf 0.32.0; other CMap // encodings fall through to `String::from_utf8_lossy(bytes)`, which yields // PUA / replacement-char territory already covered by `pure_pua_zero`. // Re-verify on lopdf dependency upgrade. const MOJIBAKE_MARKERS: &[&str] = &[ "?Identity-H Unimplemented?", ]; ``` 2. Refactor `compute_valid_char_ratio()` (lines 39–106): ```rust pub fn compute_valid_char_ratio(s: &str) -> f32 { // 1) Strip known mojibake markers before counting valid chars. // Identity-H CID fonts without ToUnicode CMap emit ASCII-only marker // substrings (bypassing PUA detection). let mut cleaned: String = s.to_string(); // `had_marker` guard preserves prior behavior for whitespace-only input // (returns ratio of whitespace validity, not 0.0) when no markers found. // With markers stripped, the guard enables the trim-empty check. let mut had_marker = false; for marker in MOJIBAKE_MARKERS { if cleaned.contains(marker) { had_marker = true; cleaned = cleaned.replace(marker, ""); } } // 2) Whitespace-only cleaned text → 0.0 (marker-only page). if had_marker && cleaned.trim().is_empty() { return 0.0; } // 3) Marker-dominance heuristic — when stripped chars exceed remaining // chars (i.e. marker > 50% of original), the page is "mostly mojibake // with some decodeable page-furniture" (e.g. metro-korea.pdf has // header text in a separate font + body that is Identity-H CID). // Force ratio downward to trigger OCR fallback (parent spec §1.3 intent). if had_marker { let stripped_chars = s.len().saturating_sub(cleaned.len()); if stripped_chars > cleaned.len() { // Marker dominates — cap ratio at 0.3 (below 0.5 OCR threshold). // The 0.3 cap (not 0.0) preserves a small signal that some text // WAS decodeable, useful for downstream metrics if ever exposed. let mut total = 0u32; let mut valid = 0u32; for c in cleaned.chars() { total += 1; if is_valid_text_char(c) { valid += 1; } } let raw_ratio = if total == 0 { 0.0 } else { valid as f32 / total as f32 }; return raw_ratio.min(0.3); } } // 4) Otherwise compute ratio on cleaned text (existing logic). let mut total = 0u32; let mut valid = 0u32; for c in cleaned.chars() { total += 1; if is_valid_text_char(c) { valid += 1; } } if total == 0 { return 0.0; } valid as f32 / total as f32 } ``` **Invariants preserved**: - Function signature and return type unchanged (→ byte-identical caller surface). - Existing character category logic (hangul, CJK, Latin-1) unmodified. - Empty-string behavior (return 0.0) preserved. ### §4.2 Bug #6: Unit tests Replace existing Bug #6 test set with two new tests reflecting marker-dominance heuristic: ```rust #[test] fn identity_h_marker_dominance_caps_ratio_below_threshold() { // metro-korea.pdf-class: 20× marker (560 char) + 11 char ASCII header. // Without dominance heuristic: ratio = 11/11 = 1.0 (bypasses OCR). // With dominance heuristic: ratio ≤ 0.3 (triggers OCR fallback). let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20)); let r = compute_valid_char_ratio(&s); assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}"); } #[test] fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() { // Inverse case: short marker noise + long valid text → ratio stays high // (no false OCR trigger on otherwise-good pages). let header = "x".repeat(200); // 200 char valid ASCII let s = format!("{header} ?Identity-H Unimplemented?"); // 1× marker = 26 char let r = compute_valid_char_ratio(&s); assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}"); } ``` **Regression preservation**: Existing 8 tests (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) must all remain green. ### §4.3 Bug #7: CLI doc-comment diff **File**: `crates/kebab-cli/src/main.rs` (SearchArgs field, lines ~150–160) **Change**: ```diff -/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `other`. Unknown values match nothing +/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `code`, `other`. Unknown values match nothing ``` ### §4.3a Bug #7 integration: `integrations/claude-code/kebab/SKILL.md:57` simultaneous update Per CLAUDE.md §Wire schema v1 invariant — in-tree integration docs must be synchronized when wire surface changes. This round has no wire schema change, but SKILL.md line 57 exhibits the same regression as §4.3 (Bug #7): **File**: `integrations/claude-code/kebab/SKILL.md` (line 57) **Change**: ```diff -`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`) +`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`) ``` ### §4.4 Bug #7: CLI help assertion Add test to `crates/kebab-cli/tests/` (or extend existing help snapshot test): ```rust #[test] fn search_help_lists_code_in_media_values() { let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab")) .args(["search", "--help"]) .output() .expect("kebab search --help"); let stdout = String::from_utf8_lossy(&out.stdout); assert!(stdout.contains("`code`"), "search --help must list 'code' as accepted --media value"); } ``` ### §4.5 Version cascade impact (CLAUDE.md §Versioning cascade) - **parser_version**: `"pdf-text-v1"` — unchanged. Text-detect threshold is internal metric, not surface. - **chunker_version**: `"pdf-page-v1.1"` — unchanged (no chunker logic affected). - **wire schema**: No new fields, no schema version bump. `compute_valid_char_ratio()` is internal to `PdfTextExtractor::extract()`. --- ## §5 Acceptance criteria - [ ] Text_quality unit test: `identity_h_marker_dominance_caps_ratio_below_threshold` passes. - [ ] Text_quality unit test: `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` passes. - [ ] Regression: All 8 existing text_quality tests remain green (no ratio behavior changes for valid text). - [ ] CLI help assertion: `cargo test search_help_lists_code_in_media_values` passes. - [ ] SKILL.md integration: `grep -F '"code"' integrations/claude-code/kebab/SKILL.md` returns ≥1 line. - [ ] Full workspace test suite: `cargo test --workspace --no-fail-fast -j 1` green (clippy + unit + integration). - [ ] Fresh binary builds: `CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli` succeeds. --- ## §6 Risks + open questions ### Identified risks **R-1 — Other lopdf unimplemented markers** (resolved per critic round 1 probe): lopdf 0.32.0 emits exactly one marker — `?Identity-H Unimplemented?` at `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text`). Other CMap encoding arms (`UniCNS`, `UniJIS`, `UniKS`, `GBK-EUC`, `Adobe-*`) fall through to `String::from_utf8_lossy(bytes)` → PUA / replacement-char territory (already covered by `pure_pua_zero` test). Marker array adequacy = OK for current lopdf pin. **Re-verify on lopdf dependency upgrade.** **R-2 — Whitespace-only edge case after stripping**: Handled by `.trim().is_empty()` check; returns 0.0 as intended. **R-3 — Version/wire schema impact**: None. text_quality is internal threshold metric, not exposed to wire schema or version cascade. **R-4 — Other `--media` help locations** (revised per critic): `--media` value list is scattered across 3 surfaces — `crates/kebab-cli/src/main.rs:157–159` (CLI doc-comment, covered by §4.3), `integrations/claude-code/kebab/SKILL.md:57` (skill doc, covered by §4.3a), `crates/kebab-cli/tests/cli_help_smoke.rs` (test, covered by §4.4). Plan drafter to run `grep '\bmedia\b' integrations/ crates/kebab-cli/src docs/wire-schema/v1` to confirm no additional surfaces exist. **R-5 — Bulk mode media field parsing**: `crates/kebab-app/src/bulk.rs:161` handles media field parsing independently; string doc-comment update does not affect functional correctness. ### Open questions **OQ-1 — Marker case sensitivity**: Does lopdf always emit markers in exact case `?Identity-H Unimplemented?`? Verify with lopdf source. If case variations exist, use case-insensitive matching or extend array. **OQ-2 — Marker stripping threshold policy** (resolved via §4.1 marker-dominance heuristic): When stripped characters exceed remaining characters, ratio is capped at 0.3 to trigger OCR fallback. This ensures marker-dominant mixed pages (e.g., 99% marker + 1% valid header) do not bypass OCR despite the header's high ratio. Design intent (parent spec §1.3) is upheld: all mojibake pages trigger OCR fallback. **OQ-3 — Alias expansion scope**: Bug #7 explicitly omits new aliases (e.g., `src` → `code`). Single additive fix to doc-comment, no enum variant changes. ### UX consequence — pre-bugfix2 v0.20 user's `--force` re-ingest This round preserves version cascade (no `parser_version` bump). The `try_skip_unchanged` path will match files indexed pre-bugfix2 with same `parser_version="pdf-text-v1"` + hash. Pre-indexed `metro-korea.pdf`-class pages will NOT automatically re-route through the corrected text-detect → OCR fallback. **User action required**: Explicit `kebab ingest --force-reingest ` to purge cached skip decisions and re-process affected files. **Release notes** (v0.20.1 or whichever version ships this bugfix) **MUST include**: "If you indexed mojibake-heavy PDFs (esp. metro-korea.pdf class) on v0.20.0 pre-bugfix2, run `kebab ingest --force-reingest ` to apply the improved text detection. Otherwise, `ingest` will skip unchanged files and OCR fallback will not trigger." + link to design §9 cascade explanation. **Documentation updates** (same PR as code): README + HANDOFF + ARCHITECTURE per `feedback_readme_sync_rule` memory — mention the `--force-reingest` step in release highlights or migration notes. Deliberate design: automatic migration risks wedging stable v0.20.0 KBs. Manual `--force-reingest` is the correct escape hatch (parent spec §1.7 line 126–128 precedent). --- ## §7 References - **Parent spec**: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md §1.3 (line 74), §1.4, §9 - **Dogfood evidence**: .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md §Bug #6, §Bug #7 - **Critic result**: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md (findings H-1 through NIT-2, parent invariant audit) - **External source**: lopdf-0.32.0/src/document.rs:523 (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?` marker) - **Code locations**: - text_quality.rs: `crates/kebab-parse-pdf/src/text_quality.rs:9-106` - CLI help: `crates/kebab-cli/src/main.rs:157–159` - Skill integration: `integrations/claude-code/kebab/SKILL.md:57` - CLI test: `crates/kebab-cli/tests/` (search_help_lists_code_in_media_values) --- **Status**: Round 1c rewrite COMPLETE. All 9 critic findings (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit) applied in-session. **Prior round reference**: Round 1 commits (d9acda5, 436fd01, 241ded5, e674ff4) are merged on branch; this round is independent (text_quality.rs vs. source-fs/connector.rs + chunk/pdf_page_v1.rs).