kebab/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md

---
title: "v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text"
created: 2026-05-27
status: "DRAFT round 1c"
parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
contract_sections: ["§1.3 (text-detect threshold metric)", "§9 (version cascade)"]
related_specs:
  - docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
  - docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
related_dogfood:
  - .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md (Bug #6 + #7)
---

# v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text

## §1 Problem statement

### §1.1 Bug #6: Identity-H Unimplemented marker bypasses mojibake detection

**Symptom**: `metro-korea.pdf` (58 MB, Identity-H CID font without ToUnicode CMap) ingests with `pdf_ocr_pages=0`. Full text contains `?Identity-H Unimplemented?` marker 1154 times. All 21 pages + 34 chunks are indexable, but content is unusable garbage — repeated marker literal instead of readable text.

**Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs` lines 9-37. The function `compute_valid_char_ratio()` via `is_valid_text_char()` treats ASCII printable range `0x0020..=0x007E` as unconditionally valid. lopdf emits `?Identity-H Unimplemented?` (28 ASCII printable chars) when it cannot decode a CID font lacking ToUnicode CMap. Result: valid_ratio = 1.0 → exceeds OCR fallback threshold 0.5 → text-detect first-pass incorrectly classifies mojibake as valid text → `pdf_ocr_pages` stays 0, no OCR fallback triggered.

**Design intent deviation**: Parent spec §1.3 (line 74) explicitly states "ratio metric judges mojibake page as scanned candidate." PoC example "֥ᬵᯝ₞e ࠦᯱᖝ░" (custom font, no ToUnicode) should trigger OCR. **Implementation gap**: literal ASCII marker case (Identity-H font) was not anticipated.

### §1.2 Bug #7: CLI `--media` help text omits `code` from valid values

**Symptom**: `kebab search --help` lists `--media` accepted values as "markdown, pdf, image, audio, other" — `code` is missing.

**Actual behavior**: `kebab search "main" --media code --json -k 5` returns 5 hits (code/script.sh, code/rust_sample/src/main.rs, etc.). Schema `media_breakdown` includes `code: 6` as first-class. Functional correctness is complete; **CLI doc-comment is outdated only**.

**Root cause**: `crates/kebab-cli/src/main.rs:148-165`. SearchArgs `--media` field clap doc-comment omits `code`. clap's `--help` renderer quotes this doc-comment directly.

---

## §2 Scope + non-scope

### §2.1 Included in this spec

- **Bug #6 fix**: Add known mojibake marker stripping to `compute_valid_char_ratio()`.
- **Bug #6 test**: Three new unit tests covering Identity-H / Identity-V markers (full-text, mixed-text cases).
- **Bug #6 regression**: Verify existing 8 text_quality unit tests remain green.
- **Bug #7 fix**: Update CLI `--media` doc-comment to include `code`.
- **Bug #7 test**: Assert that `kebab search --help` output contains "`code`" substring.
- **Traceability**: Link both fixes to parent spec §1.3 design intent.

### §2.2 Explicitly out of scope

**Bug #8 candidate (falsified)**: V007 trigram tokenizer already applied; 2-character query limitation is design-level constraint, not a bug. Handled in prior dogfood report §Bug #8.

**Non-bug observations**:
- `--readonly + ingest` exit=0: Graceful refusal per CLAUDE.md intent (exit codes 0/1/2/3 unchanged; `error.v1.code` handles agent branching).
- Ask phrasing-sensitive refusal: RAG corner case; not a code defect.
- Binary staleness: Environmental artifact, not applicable to spec.

**Ancillary risks**:
- Scan for other `--media` doc-comment locations (R-4): Plan drafter to use grep; not blockers for this spec.
- Other lopdf unimplemented markers (R-1): Plan drafter to inspect lopdf source; marker array is extensible.

---

## §3 Decisions

### §3.1 Bug #6: Known mojibake marker stripping

Strip known mojibake marker substrings **before ratio calculation**, then force ratio to 0.0 if remainder is empty after marker removal. When stripped characters exceed remaining characters (marker dominance), cap ratio at 0.3 to trigger OCR fallback on marker-heavy mixed pages.

**Rationale**: lopdf's unimplemented CID font handling consistently emits specific ASCII marker strings. Hardcoding them is lightweight, deterministic, and covers the known failure mode without requiring expensive heuristics (e.g., ML-based gibberish detection). Pages like `metro-korea.pdf` may contain mostly mojibake body text with small valid headers; the marker-dominance check ensures such pages fall below the 0.5 OCR threshold.

**Marker list**: `?Identity-H Unimplemented?` only. lopdf 0.32.0 emits exactly one marker (verified per critic round 1 probe). Extensible if future lopdf versions emit additional markers.

### §3.2 Bug #7: CLI doc-comment update

Add `code` to the comma-separated list of valid `--media` values in the SearchArgs field's clap doc-comment. Single-line edit; no functional or schema changes.

### §3.3 Parent spec traceability

Both fixes uphold parent spec §1.3:
- Bug #6 ensures mojibake pages (Text CMap-missing fonts) trigger OCR fallback per design intent.
- Bug #7 corrects CLI documentation to match actual schema (first-class `code` media type supported since v0.18.0).

No changes to parser_version, chunker_version, or wire schema.

---

## §4 Implementation specification

### §4.1 Bug #6: text_quality.rs diff

**File**: `crates/kebab-parse-pdf/src/text_quality.rs`

**Change**:
1. Add constant array of known mojibake markers (lines 8–10):
   ```rust
   // Source of truth: lopdf-0.32.0/src/document.rs:523 (Document::decode_text).
   // Only one Unimplemented marker is emitted by lopdf 0.32.0; other CMap
   // encodings fall through to `String::from_utf8_lossy(bytes)`, which yields
   // PUA / replacement-char territory already covered by `pure_pua_zero`.
   // Re-verify on lopdf dependency upgrade.
   const MOJIBAKE_MARKERS: &[&str] = &[
       "?Identity-H Unimplemented?",
   ];
   ```

2. Refactor `compute_valid_char_ratio()` (lines 39–106):
   ```rust
   pub fn compute_valid_char_ratio(s: &str) -> f32 {
       // 1) Strip known mojibake markers before counting valid chars.
       //    Identity-H CID fonts without ToUnicode CMap emit ASCII-only marker
       //    substrings (bypassing PUA detection).
       let mut cleaned: String = s.to_string();
       // `had_marker` guard preserves prior behavior for whitespace-only input
       // (returns ratio of whitespace validity, not 0.0) when no markers found.
       // With markers stripped, the guard enables the trim-empty check.
       let mut had_marker = false;
       for marker in MOJIBAKE_MARKERS {
           if cleaned.contains(marker) {
               had_marker = true;
               cleaned = cleaned.replace(marker, "");
           }
       }
       // 2) Whitespace-only cleaned text → 0.0 (marker-only page).
       if had_marker && cleaned.trim().is_empty() {
           return 0.0;
       }
       // 3) Marker-dominance heuristic — when stripped chars exceed remaining
       //    chars (i.e. marker > 50% of original), the page is "mostly mojibake
       //    with some decodeable page-furniture" (e.g. metro-korea.pdf has
       //    header text in a separate font + body that is Identity-H CID).
       //    Force ratio downward to trigger OCR fallback (parent spec §1.3 intent).
       if had_marker {
           let stripped_chars = s.len().saturating_sub(cleaned.len());
           if stripped_chars > cleaned.len() {
               // Marker dominates — cap ratio at 0.3 (below 0.5 OCR threshold).
               // The 0.3 cap (not 0.0) preserves a small signal that some text
               // WAS decodeable, useful for downstream metrics if ever exposed.
               let mut total = 0u32;
               let mut valid = 0u32;
               for c in cleaned.chars() {
                   total += 1;
                   if is_valid_text_char(c) {
                       valid += 1;
                   }
               }
               let raw_ratio = if total == 0 { 0.0 } else { valid as f32 / total as f32 };
               return raw_ratio.min(0.3);
           }
       }
       // 4) Otherwise compute ratio on cleaned text (existing logic).
       let mut total = 0u32;
       let mut valid = 0u32;
       for c in cleaned.chars() {
           total += 1;
           if is_valid_text_char(c) {
               valid += 1;
           }
       }
       if total == 0 {
           return 0.0;
       }
       valid as f32 / total as f32
   }
   ```

**Invariants preserved**:
- Function signature and return type unchanged (→ byte-identical caller surface).
- Existing character category logic (hangul, CJK, Latin-1) unmodified.
- Empty-string behavior (return 0.0) preserved.

### §4.2 Bug #6: Unit tests

Replace existing Bug #6 test set with two new tests reflecting marker-dominance heuristic:

```rust
#[test]
fn identity_h_marker_dominance_caps_ratio_below_threshold() {
    // metro-korea.pdf-class: 20× marker (560 char) + 11 char ASCII header.
    // Without dominance heuristic: ratio = 11/11 = 1.0 (bypasses OCR).
    // With dominance heuristic: ratio ≤ 0.3 (triggers OCR fallback).
    let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20));
    let r = compute_valid_char_ratio(&s);
    assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}");
}

#[test]
fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() {
    // Inverse case: short marker noise + long valid text → ratio stays high
    // (no false OCR trigger on otherwise-good pages).
    let header = "x".repeat(200);  // 200 char valid ASCII
    let s = format!("{header} ?Identity-H Unimplemented?");  // 1× marker = 26 char
    let r = compute_valid_char_ratio(&s);
    assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}");
}
```

**Regression preservation**: Existing 8 tests (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) must all remain green.

### §4.3 Bug #7: CLI doc-comment diff

**File**: `crates/kebab-cli/src/main.rs` (SearchArgs field, lines ~150–160)

**Change**:
```diff
-/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `other`. Unknown values match nothing
+/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `code`, `other`. Unknown values match nothing
```

### §4.3a Bug #7 integration: `integrations/claude-code/kebab/SKILL.md:57` simultaneous update

Per CLAUDE.md §Wire schema v1 invariant — in-tree integration docs must be synchronized when wire surface changes. This round has no wire schema change, but SKILL.md line 57 exhibits the same regression as §4.3 (Bug #7):

**File**: `integrations/claude-code/kebab/SKILL.md` (line 57)

**Change**:
```diff
-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`)
+`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`)
```

### §4.4 Bug #7: CLI help assertion

Add test to `crates/kebab-cli/tests/` (or extend existing help snapshot test):

```rust
#[test]
fn search_help_lists_code_in_media_values() {
    let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab"))
        .args(["search", "--help"])
        .output()
        .expect("kebab search --help");
    let stdout = String::from_utf8_lossy(&out.stdout);
    assert!(stdout.contains("`code`"), "search --help must list 'code' as accepted --media value");
}
```

### §4.5 Version cascade impact (CLAUDE.md §Versioning cascade)

- **parser_version**: `"pdf-text-v1"` — unchanged. Text-detect threshold is internal metric, not surface.
- **chunker_version**: `"pdf-page-v1.1"` — unchanged (no chunker logic affected).
- **wire schema**: No new fields, no schema version bump. `compute_valid_char_ratio()` is internal to `PdfTextExtractor::extract()`.

---

## §5 Acceptance criteria

- [ ] Text_quality unit test: `identity_h_marker_dominance_caps_ratio_below_threshold` passes.
- [ ] Text_quality unit test: `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` passes.
- [ ] Regression: All 8 existing text_quality tests remain green (no ratio behavior changes for valid text).
- [ ] CLI help assertion: `cargo test search_help_lists_code_in_media_values` passes.
- [ ] SKILL.md integration: `grep -F '"code"' integrations/claude-code/kebab/SKILL.md` returns ≥1 line.
- [ ] Full workspace test suite: `cargo test --workspace --no-fail-fast -j 1` green (clippy + unit + integration).
- [ ] Fresh binary builds: `CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli` succeeds.

---

## §6 Risks + open questions

### Identified risks

**R-1 — Other lopdf unimplemented markers** (resolved per critic round 1 probe): lopdf 0.32.0 emits exactly one marker — `?Identity-H Unimplemented?` at `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text`). Other CMap encoding arms (`UniCNS`, `UniJIS`, `UniKS`, `GBK-EUC`, `Adobe-*`) fall through to `String::from_utf8_lossy(bytes)` → PUA / replacement-char territory (already covered by `pure_pua_zero` test). Marker array adequacy = OK for current lopdf pin. **Re-verify on lopdf dependency upgrade.**

**R-2 — Whitespace-only edge case after stripping**: Handled by `.trim().is_empty()` check; returns 0.0 as intended.

**R-3 — Version/wire schema impact**: None. text_quality is internal threshold metric, not exposed to wire schema or version cascade.

**R-4 — Other `--media` help locations** (revised per critic): `--media` value list is scattered across 3 surfaces — `crates/kebab-cli/src/main.rs:157–159` (CLI doc-comment, covered by §4.3), `integrations/claude-code/kebab/SKILL.md:57` (skill doc, covered by §4.3a), `crates/kebab-cli/tests/cli_help_smoke.rs` (test, covered by §4.4). Plan drafter to run `grep '\bmedia\b' integrations/ crates/kebab-cli/src docs/wire-schema/v1` to confirm no additional surfaces exist.

**R-5 — Bulk mode media field parsing**: `crates/kebab-app/src/bulk.rs:161` handles media field parsing independently; string doc-comment update does not affect functional correctness.

### Open questions

**OQ-1 — Marker case sensitivity**: Does lopdf always emit markers in exact case `?Identity-H Unimplemented?`? Verify with lopdf source. If case variations exist, use case-insensitive matching or extend array.

**OQ-2 — Marker stripping threshold policy** (resolved via §4.1 marker-dominance heuristic): When stripped characters exceed remaining characters, ratio is capped at 0.3 to trigger OCR fallback. This ensures marker-dominant mixed pages (e.g., 99% marker + 1% valid header) do not bypass OCR despite the header's high ratio. Design intent (parent spec §1.3) is upheld: all mojibake pages trigger OCR fallback.

**OQ-3 — Alias expansion scope**: Bug #7 explicitly omits new aliases (e.g., `src` → `code`). Single additive fix to doc-comment, no enum variant changes.

### UX consequence — pre-bugfix2 v0.20 user's `--force` re-ingest

This round preserves version cascade (no `parser_version` bump). The `try_skip_unchanged` path will match files indexed pre-bugfix2 with same `parser_version="pdf-text-v1"` + hash. Pre-indexed `metro-korea.pdf`-class pages will NOT automatically re-route through the corrected text-detect → OCR fallback.

**User action required**: Explicit `kebab ingest --force-reingest <workspace>` to purge cached skip decisions and re-process affected files.

**Release notes** (v0.20.1 or whichever version ships this bugfix) **MUST include**: "If you indexed mojibake-heavy PDFs (esp. metro-korea.pdf class) on v0.20.0 pre-bugfix2, run `kebab ingest --force-reingest <workspace>` to apply the improved text detection. Otherwise, `ingest` will skip unchanged files and OCR fallback will not trigger." + link to design §9 cascade explanation.

**Documentation updates** (same PR as code): README + HANDOFF + ARCHITECTURE per `feedback_readme_sync_rule` memory — mention the `--force-reingest` step in release highlights or migration notes.

Deliberate design: automatic migration risks wedging stable v0.20.0 KBs. Manual `--force-reingest` is the correct escape hatch (parent spec §1.7 line 126–128 precedent).

---

## §7 References

- **Parent spec**: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md §1.3 (line 74), §1.4, §9
- **Dogfood evidence**: .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md §Bug #6, §Bug #7
- **Critic result**: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md (findings H-1 through NIT-2, parent invariant audit)
- **External source**: lopdf-0.32.0/src/document.rs:523 (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?` marker)
- **Code locations**:
  - text_quality.rs: `crates/kebab-parse-pdf/src/text_quality.rs:9-106`
  - CLI help: `crates/kebab-cli/src/main.rs:157–159`
  - Skill integration: `integrations/claude-code/kebab/SKILL.md:57`
  - CLI test: `crates/kebab-cli/tests/` (search_help_lists_code_in_media_values)

---

**Status**: Round 1c rewrite COMPLETE. All 9 critic findings (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit) applied in-session.

**Prior round reference**: Round 1 commits (d9acda5, 436fd01, 241ded5, e674ff4) are merged on branch; this round is independent (text_quality.rs vs. source-fs/connector.rs + chunk/pdf_page_v1.rs).