Files
kebab/docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
altair823 46e99470eb docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md
3-round dogfood-driven fix cycle 의 산출물:

- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)

각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:21:34 +00:00

309 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text"
created: 2026-05-27
status: "DRAFT round 1c"
parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
contract_sections: ["§1.3 (text-detect threshold metric)", "§9 (version cascade)"]
related_specs:
- docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
- docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
related_dogfood:
- .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md (Bug #6 + #7)
---
# v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text
## §1 Problem statement
### §1.1 Bug #6: Identity-H Unimplemented marker bypasses mojibake detection
**Symptom**: `metro-korea.pdf` (58 MB, Identity-H CID font without ToUnicode CMap) ingests with `pdf_ocr_pages=0`. Full text contains `?Identity-H Unimplemented?` marker 1154 times. All 21 pages + 34 chunks are indexable, but content is unusable garbage — repeated marker literal instead of readable text.
**Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs` lines 9-37. The function `compute_valid_char_ratio()` via `is_valid_text_char()` treats ASCII printable range `0x0020..=0x007E` as unconditionally valid. lopdf emits `?Identity-H Unimplemented?` (28 ASCII printable chars) when it cannot decode a CID font lacking ToUnicode CMap. Result: valid_ratio = 1.0 → exceeds OCR fallback threshold 0.5 → text-detect first-pass incorrectly classifies mojibake as valid text → `pdf_ocr_pages` stays 0, no OCR fallback triggered.
**Design intent deviation**: Parent spec §1.3 (line 74) explicitly states "ratio metric judges mojibake page as scanned candidate." PoC example "֥ᬵᯝ₞e ࠦᯱᖝ░" (custom font, no ToUnicode) should trigger OCR. **Implementation gap**: literal ASCII marker case (Identity-H font) was not anticipated.
### §1.2 Bug #7: CLI `--media` help text omits `code` from valid values
**Symptom**: `kebab search --help` lists `--media` accepted values as "markdown, pdf, image, audio, other" — `code` is missing.
**Actual behavior**: `kebab search "main" --media code --json -k 5` returns 5 hits (code/script.sh, code/rust_sample/src/main.rs, etc.). Schema `media_breakdown` includes `code: 6` as first-class. Functional correctness is complete; **CLI doc-comment is outdated only**.
**Root cause**: `crates/kebab-cli/src/main.rs:148-165`. SearchArgs `--media` field clap doc-comment omits `code`. clap's `--help` renderer quotes this doc-comment directly.
---
## §2 Scope + non-scope
### §2.1 Included in this spec
- **Bug #6 fix**: Add known mojibake marker stripping to `compute_valid_char_ratio()`.
- **Bug #6 test**: Three new unit tests covering Identity-H / Identity-V markers (full-text, mixed-text cases).
- **Bug #6 regression**: Verify existing 8 text_quality unit tests remain green.
- **Bug #7 fix**: Update CLI `--media` doc-comment to include `code`.
- **Bug #7 test**: Assert that `kebab search --help` output contains "`code`" substring.
- **Traceability**: Link both fixes to parent spec §1.3 design intent.
### §2.2 Explicitly out of scope
**Bug #8 candidate (falsified)**: V007 trigram tokenizer already applied; 2-character query limitation is design-level constraint, not a bug. Handled in prior dogfood report §Bug #8.
**Non-bug observations**:
- `--readonly + ingest` exit=0: Graceful refusal per CLAUDE.md intent (exit codes 0/1/2/3 unchanged; `error.v1.code` handles agent branching).
- Ask phrasing-sensitive refusal: RAG corner case; not a code defect.
- Binary staleness: Environmental artifact, not applicable to spec.
**Ancillary risks**:
- Scan for other `--media` doc-comment locations (R-4): Plan drafter to use grep; not blockers for this spec.
- Other lopdf unimplemented markers (R-1): Plan drafter to inspect lopdf source; marker array is extensible.
---
## §3 Decisions
### §3.1 Bug #6: Known mojibake marker stripping
Strip known mojibake marker substrings **before ratio calculation**, then force ratio to 0.0 if remainder is empty after marker removal. When stripped characters exceed remaining characters (marker dominance), cap ratio at 0.3 to trigger OCR fallback on marker-heavy mixed pages.
**Rationale**: lopdf's unimplemented CID font handling consistently emits specific ASCII marker strings. Hardcoding them is lightweight, deterministic, and covers the known failure mode without requiring expensive heuristics (e.g., ML-based gibberish detection). Pages like `metro-korea.pdf` may contain mostly mojibake body text with small valid headers; the marker-dominance check ensures such pages fall below the 0.5 OCR threshold.
**Marker list**: `?Identity-H Unimplemented?` only. lopdf 0.32.0 emits exactly one marker (verified per critic round 1 probe). Extensible if future lopdf versions emit additional markers.
### §3.2 Bug #7: CLI doc-comment update
Add `code` to the comma-separated list of valid `--media` values in the SearchArgs field's clap doc-comment. Single-line edit; no functional or schema changes.
### §3.3 Parent spec traceability
Both fixes uphold parent spec §1.3:
- Bug #6 ensures mojibake pages (Text CMap-missing fonts) trigger OCR fallback per design intent.
- Bug #7 corrects CLI documentation to match actual schema (first-class `code` media type supported since v0.18.0).
No changes to parser_version, chunker_version, or wire schema.
---
## §4 Implementation specification
### §4.1 Bug #6: text_quality.rs diff
**File**: `crates/kebab-parse-pdf/src/text_quality.rs`
**Change**:
1. Add constant array of known mojibake markers (lines 810):
```rust
// Source of truth: lopdf-0.32.0/src/document.rs:523 (Document::decode_text).
// Only one Unimplemented marker is emitted by lopdf 0.32.0; other CMap
// encodings fall through to `String::from_utf8_lossy(bytes)`, which yields
// PUA / replacement-char territory already covered by `pure_pua_zero`.
// Re-verify on lopdf dependency upgrade.
const MOJIBAKE_MARKERS: &[&str] = &[
"?Identity-H Unimplemented?",
];
```
2. Refactor `compute_valid_char_ratio()` (lines 39106):
```rust
pub fn compute_valid_char_ratio(s: &str) -> f32 {
// 1) Strip known mojibake markers before counting valid chars.
// Identity-H CID fonts without ToUnicode CMap emit ASCII-only marker
// substrings (bypassing PUA detection).
let mut cleaned: String = s.to_string();
// `had_marker` guard preserves prior behavior for whitespace-only input
// (returns ratio of whitespace validity, not 0.0) when no markers found.
// With markers stripped, the guard enables the trim-empty check.
let mut had_marker = false;
for marker in MOJIBAKE_MARKERS {
if cleaned.contains(marker) {
had_marker = true;
cleaned = cleaned.replace(marker, "");
}
}
// 2) Whitespace-only cleaned text → 0.0 (marker-only page).
if had_marker && cleaned.trim().is_empty() {
return 0.0;
}
// 3) Marker-dominance heuristic — when stripped chars exceed remaining
// chars (i.e. marker > 50% of original), the page is "mostly mojibake
// with some decodeable page-furniture" (e.g. metro-korea.pdf has
// header text in a separate font + body that is Identity-H CID).
// Force ratio downward to trigger OCR fallback (parent spec §1.3 intent).
if had_marker {
let stripped_chars = s.len().saturating_sub(cleaned.len());
if stripped_chars > cleaned.len() {
// Marker dominates — cap ratio at 0.3 (below 0.5 OCR threshold).
// The 0.3 cap (not 0.0) preserves a small signal that some text
// WAS decodeable, useful for downstream metrics if ever exposed.
let mut total = 0u32;
let mut valid = 0u32;
for c in cleaned.chars() {
total += 1;
if is_valid_text_char(c) {
valid += 1;
}
}
let raw_ratio = if total == 0 { 0.0 } else { valid as f32 / total as f32 };
return raw_ratio.min(0.3);
}
}
// 4) Otherwise compute ratio on cleaned text (existing logic).
let mut total = 0u32;
let mut valid = 0u32;
for c in cleaned.chars() {
total += 1;
if is_valid_text_char(c) {
valid += 1;
}
}
if total == 0 {
return 0.0;
}
valid as f32 / total as f32
}
```
**Invariants preserved**:
- Function signature and return type unchanged (→ byte-identical caller surface).
- Existing character category logic (hangul, CJK, Latin-1) unmodified.
- Empty-string behavior (return 0.0) preserved.
### §4.2 Bug #6: Unit tests
Replace existing Bug #6 test set with two new tests reflecting marker-dominance heuristic:
```rust
#[test]
fn identity_h_marker_dominance_caps_ratio_below_threshold() {
// metro-korea.pdf-class: 20× marker (560 char) + 11 char ASCII header.
// Without dominance heuristic: ratio = 11/11 = 1.0 (bypasses OCR).
// With dominance heuristic: ratio ≤ 0.3 (triggers OCR fallback).
let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20));
let r = compute_valid_char_ratio(&s);
assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}");
}
#[test]
fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() {
// Inverse case: short marker noise + long valid text → ratio stays high
// (no false OCR trigger on otherwise-good pages).
let header = "x".repeat(200); // 200 char valid ASCII
let s = format!("{header} ?Identity-H Unimplemented?"); // 1× marker = 26 char
let r = compute_valid_char_ratio(&s);
assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}");
}
```
**Regression preservation**: Existing 8 tests (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) must all remain green.
### §4.3 Bug #7: CLI doc-comment diff
**File**: `crates/kebab-cli/src/main.rs` (SearchArgs field, lines ~150160)
**Change**:
```diff
-/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `other`. Unknown values match nothing
+/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `code`, `other`. Unknown values match nothing
```
### §4.3a Bug #7 integration: `integrations/claude-code/kebab/SKILL.md:57` simultaneous update
Per CLAUDE.md §Wire schema v1 invariant — in-tree integration docs must be synchronized when wire surface changes. This round has no wire schema change, but SKILL.md line 57 exhibits the same regression as §4.3 (Bug #7):
**File**: `integrations/claude-code/kebab/SKILL.md` (line 57)
**Change**:
```diff
-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`)
+`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`)
```
### §4.4 Bug #7: CLI help assertion
Add test to `crates/kebab-cli/tests/` (or extend existing help snapshot test):
```rust
#[test]
fn search_help_lists_code_in_media_values() {
let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab"))
.args(["search", "--help"])
.output()
.expect("kebab search --help");
let stdout = String::from_utf8_lossy(&out.stdout);
assert!(stdout.contains("`code`"), "search --help must list 'code' as accepted --media value");
}
```
### §4.5 Version cascade impact (CLAUDE.md §Versioning cascade)
- **parser_version**: `"pdf-text-v1"` — unchanged. Text-detect threshold is internal metric, not surface.
- **chunker_version**: `"pdf-page-v1.1"` — unchanged (no chunker logic affected).
- **wire schema**: No new fields, no schema version bump. `compute_valid_char_ratio()` is internal to `PdfTextExtractor::extract()`.
---
## §5 Acceptance criteria
- [ ] Text_quality unit test: `identity_h_marker_dominance_caps_ratio_below_threshold` passes.
- [ ] Text_quality unit test: `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` passes.
- [ ] Regression: All 8 existing text_quality tests remain green (no ratio behavior changes for valid text).
- [ ] CLI help assertion: `cargo test search_help_lists_code_in_media_values` passes.
- [ ] SKILL.md integration: `grep -F '"code"' integrations/claude-code/kebab/SKILL.md` returns ≥1 line.
- [ ] Full workspace test suite: `cargo test --workspace --no-fail-fast -j 1` green (clippy + unit + integration).
- [ ] Fresh binary builds: `CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli` succeeds.
---
## §6 Risks + open questions
### Identified risks
**R-1 — Other lopdf unimplemented markers** (resolved per critic round 1 probe): lopdf 0.32.0 emits exactly one marker — `?Identity-H Unimplemented?` at `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text`). Other CMap encoding arms (`UniCNS`, `UniJIS`, `UniKS`, `GBK-EUC`, `Adobe-*`) fall through to `String::from_utf8_lossy(bytes)` → PUA / replacement-char territory (already covered by `pure_pua_zero` test). Marker array adequacy = OK for current lopdf pin. **Re-verify on lopdf dependency upgrade.**
**R-2 — Whitespace-only edge case after stripping**: Handled by `.trim().is_empty()` check; returns 0.0 as intended.
**R-3 — Version/wire schema impact**: None. text_quality is internal threshold metric, not exposed to wire schema or version cascade.
**R-4 — Other `--media` help locations** (revised per critic): `--media` value list is scattered across 3 surfaces — `crates/kebab-cli/src/main.rs:157159` (CLI doc-comment, covered by §4.3), `integrations/claude-code/kebab/SKILL.md:57` (skill doc, covered by §4.3a), `crates/kebab-cli/tests/cli_help_smoke.rs` (test, covered by §4.4). Plan drafter to run `grep '\bmedia\b' integrations/ crates/kebab-cli/src docs/wire-schema/v1` to confirm no additional surfaces exist.
**R-5 — Bulk mode media field parsing**: `crates/kebab-app/src/bulk.rs:161` handles media field parsing independently; string doc-comment update does not affect functional correctness.
### Open questions
**OQ-1 — Marker case sensitivity**: Does lopdf always emit markers in exact case `?Identity-H Unimplemented?`? Verify with lopdf source. If case variations exist, use case-insensitive matching or extend array.
**OQ-2 — Marker stripping threshold policy** (resolved via §4.1 marker-dominance heuristic): When stripped characters exceed remaining characters, ratio is capped at 0.3 to trigger OCR fallback. This ensures marker-dominant mixed pages (e.g., 99% marker + 1% valid header) do not bypass OCR despite the header's high ratio. Design intent (parent spec §1.3) is upheld: all mojibake pages trigger OCR fallback.
**OQ-3 — Alias expansion scope**: Bug #7 explicitly omits new aliases (e.g., `src` → `code`). Single additive fix to doc-comment, no enum variant changes.
### UX consequence — pre-bugfix2 v0.20 user's `--force` re-ingest
This round preserves version cascade (no `parser_version` bump). The `try_skip_unchanged` path will match files indexed pre-bugfix2 with same `parser_version="pdf-text-v1"` + hash. Pre-indexed `metro-korea.pdf`-class pages will NOT automatically re-route through the corrected text-detect → OCR fallback.
**User action required**: Explicit `kebab ingest --force-reingest <workspace>` to purge cached skip decisions and re-process affected files.
**Release notes** (v0.20.1 or whichever version ships this bugfix) **MUST include**: "If you indexed mojibake-heavy PDFs (esp. metro-korea.pdf class) on v0.20.0 pre-bugfix2, run `kebab ingest --force-reingest <workspace>` to apply the improved text detection. Otherwise, `ingest` will skip unchanged files and OCR fallback will not trigger." + link to design §9 cascade explanation.
**Documentation updates** (same PR as code): README + HANDOFF + ARCHITECTURE per `feedback_readme_sync_rule` memory — mention the `--force-reingest` step in release highlights or migration notes.
Deliberate design: automatic migration risks wedging stable v0.20.0 KBs. Manual `--force-reingest` is the correct escape hatch (parent spec §1.7 line 126128 precedent).
---
## §7 References
- **Parent spec**: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md §1.3 (line 74), §1.4, §9
- **Dogfood evidence**: .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md §Bug #6, §Bug #7
- **Critic result**: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md (findings H-1 through NIT-2, parent invariant audit)
- **External source**: lopdf-0.32.0/src/document.rs:523 (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?` marker)
- **Code locations**:
- text_quality.rs: `crates/kebab-parse-pdf/src/text_quality.rs:9-106`
- CLI help: `crates/kebab-cli/src/main.rs:157159`
- Skill integration: `integrations/claude-code/kebab/SKILL.md:57`
- CLI test: `crates/kebab-cli/tests/` (search_help_lists_code_in_media_values)
---
**Status**: Round 1c rewrite COMPLETE. All 9 critic findings (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit) applied in-session.
**Prior round reference**: Round 1 commits (d9acda5, 436fd01, 241ded5, e674ff4) are merged on branch; this round is independent (text_quality.rs vs. source-fs/connector.rs + chunk/pdf_page_v1.rs).