3-round dogfood-driven fix cycle 의 산출물: - bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line - bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line - bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line - docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus) 각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
309 lines
18 KiB
Markdown
309 lines
18 KiB
Markdown
---
|
||
title: "v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text"
|
||
created: 2026-05-27
|
||
status: "DRAFT round 1c"
|
||
parent_spec: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
|
||
contract_sections: ["§1.3 (text-detect threshold metric)", "§9 (version cascade)"]
|
||
related_specs:
|
||
- docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md
|
||
- docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
related_dogfood:
|
||
- .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md (Bug #6 + #7)
|
||
---
|
||
|
||
# v0.20.0 sub-item 1 bugfix round 2 — Identity-H mojibake marker + CLI --media help text
|
||
|
||
## §1 Problem statement
|
||
|
||
### §1.1 Bug #6: Identity-H Unimplemented marker bypasses mojibake detection
|
||
|
||
**Symptom**: `metro-korea.pdf` (58 MB, Identity-H CID font without ToUnicode CMap) ingests with `pdf_ocr_pages=0`. Full text contains `?Identity-H Unimplemented?` marker 1154 times. All 21 pages + 34 chunks are indexable, but content is unusable garbage — repeated marker literal instead of readable text.
|
||
|
||
**Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs` lines 9-37. The function `compute_valid_char_ratio()` via `is_valid_text_char()` treats ASCII printable range `0x0020..=0x007E` as unconditionally valid. lopdf emits `?Identity-H Unimplemented?` (28 ASCII printable chars) when it cannot decode a CID font lacking ToUnicode CMap. Result: valid_ratio = 1.0 → exceeds OCR fallback threshold 0.5 → text-detect first-pass incorrectly classifies mojibake as valid text → `pdf_ocr_pages` stays 0, no OCR fallback triggered.
|
||
|
||
**Design intent deviation**: Parent spec §1.3 (line 74) explicitly states "ratio metric judges mojibake page as scanned candidate." PoC example "֥ᬵᯝe ࠦᯱᖝ░" (custom font, no ToUnicode) should trigger OCR. **Implementation gap**: literal ASCII marker case (Identity-H font) was not anticipated.
|
||
|
||
### §1.2 Bug #7: CLI `--media` help text omits `code` from valid values
|
||
|
||
**Symptom**: `kebab search --help` lists `--media` accepted values as "markdown, pdf, image, audio, other" — `code` is missing.
|
||
|
||
**Actual behavior**: `kebab search "main" --media code --json -k 5` returns 5 hits (code/script.sh, code/rust_sample/src/main.rs, etc.). Schema `media_breakdown` includes `code: 6` as first-class. Functional correctness is complete; **CLI doc-comment is outdated only**.
|
||
|
||
**Root cause**: `crates/kebab-cli/src/main.rs:148-165`. SearchArgs `--media` field clap doc-comment omits `code`. clap's `--help` renderer quotes this doc-comment directly.
|
||
|
||
---
|
||
|
||
## §2 Scope + non-scope
|
||
|
||
### §2.1 Included in this spec
|
||
|
||
- **Bug #6 fix**: Add known mojibake marker stripping to `compute_valid_char_ratio()`.
|
||
- **Bug #6 test**: Three new unit tests covering Identity-H / Identity-V markers (full-text, mixed-text cases).
|
||
- **Bug #6 regression**: Verify existing 8 text_quality unit tests remain green.
|
||
- **Bug #7 fix**: Update CLI `--media` doc-comment to include `code`.
|
||
- **Bug #7 test**: Assert that `kebab search --help` output contains "`code`" substring.
|
||
- **Traceability**: Link both fixes to parent spec §1.3 design intent.
|
||
|
||
### §2.2 Explicitly out of scope
|
||
|
||
**Bug #8 candidate (falsified)**: V007 trigram tokenizer already applied; 2-character query limitation is design-level constraint, not a bug. Handled in prior dogfood report §Bug #8.
|
||
|
||
**Non-bug observations**:
|
||
- `--readonly + ingest` exit=0: Graceful refusal per CLAUDE.md intent (exit codes 0/1/2/3 unchanged; `error.v1.code` handles agent branching).
|
||
- Ask phrasing-sensitive refusal: RAG corner case; not a code defect.
|
||
- Binary staleness: Environmental artifact, not applicable to spec.
|
||
|
||
**Ancillary risks**:
|
||
- Scan for other `--media` doc-comment locations (R-4): Plan drafter to use grep; not blockers for this spec.
|
||
- Other lopdf unimplemented markers (R-1): Plan drafter to inspect lopdf source; marker array is extensible.
|
||
|
||
---
|
||
|
||
## §3 Decisions
|
||
|
||
### §3.1 Bug #6: Known mojibake marker stripping
|
||
|
||
Strip known mojibake marker substrings **before ratio calculation**, then force ratio to 0.0 if remainder is empty after marker removal. When stripped characters exceed remaining characters (marker dominance), cap ratio at 0.3 to trigger OCR fallback on marker-heavy mixed pages.
|
||
|
||
**Rationale**: lopdf's unimplemented CID font handling consistently emits specific ASCII marker strings. Hardcoding them is lightweight, deterministic, and covers the known failure mode without requiring expensive heuristics (e.g., ML-based gibberish detection). Pages like `metro-korea.pdf` may contain mostly mojibake body text with small valid headers; the marker-dominance check ensures such pages fall below the 0.5 OCR threshold.
|
||
|
||
**Marker list**: `?Identity-H Unimplemented?` only. lopdf 0.32.0 emits exactly one marker (verified per critic round 1 probe). Extensible if future lopdf versions emit additional markers.
|
||
|
||
### §3.2 Bug #7: CLI doc-comment update
|
||
|
||
Add `code` to the comma-separated list of valid `--media` values in the SearchArgs field's clap doc-comment. Single-line edit; no functional or schema changes.
|
||
|
||
### §3.3 Parent spec traceability
|
||
|
||
Both fixes uphold parent spec §1.3:
|
||
- Bug #6 ensures mojibake pages (Text CMap-missing fonts) trigger OCR fallback per design intent.
|
||
- Bug #7 corrects CLI documentation to match actual schema (first-class `code` media type supported since v0.18.0).
|
||
|
||
No changes to parser_version, chunker_version, or wire schema.
|
||
|
||
---
|
||
|
||
## §4 Implementation specification
|
||
|
||
### §4.1 Bug #6: text_quality.rs diff
|
||
|
||
**File**: `crates/kebab-parse-pdf/src/text_quality.rs`
|
||
|
||
**Change**:
|
||
1. Add constant array of known mojibake markers (lines 8–10):
|
||
```rust
|
||
// Source of truth: lopdf-0.32.0/src/document.rs:523 (Document::decode_text).
|
||
// Only one Unimplemented marker is emitted by lopdf 0.32.0; other CMap
|
||
// encodings fall through to `String::from_utf8_lossy(bytes)`, which yields
|
||
// PUA / replacement-char territory already covered by `pure_pua_zero`.
|
||
// Re-verify on lopdf dependency upgrade.
|
||
const MOJIBAKE_MARKERS: &[&str] = &[
|
||
"?Identity-H Unimplemented?",
|
||
];
|
||
```
|
||
|
||
2. Refactor `compute_valid_char_ratio()` (lines 39–106):
|
||
```rust
|
||
pub fn compute_valid_char_ratio(s: &str) -> f32 {
|
||
// 1) Strip known mojibake markers before counting valid chars.
|
||
// Identity-H CID fonts without ToUnicode CMap emit ASCII-only marker
|
||
// substrings (bypassing PUA detection).
|
||
let mut cleaned: String = s.to_string();
|
||
// `had_marker` guard preserves prior behavior for whitespace-only input
|
||
// (returns ratio of whitespace validity, not 0.0) when no markers found.
|
||
// With markers stripped, the guard enables the trim-empty check.
|
||
let mut had_marker = false;
|
||
for marker in MOJIBAKE_MARKERS {
|
||
if cleaned.contains(marker) {
|
||
had_marker = true;
|
||
cleaned = cleaned.replace(marker, "");
|
||
}
|
||
}
|
||
// 2) Whitespace-only cleaned text → 0.0 (marker-only page).
|
||
if had_marker && cleaned.trim().is_empty() {
|
||
return 0.0;
|
||
}
|
||
// 3) Marker-dominance heuristic — when stripped chars exceed remaining
|
||
// chars (i.e. marker > 50% of original), the page is "mostly mojibake
|
||
// with some decodeable page-furniture" (e.g. metro-korea.pdf has
|
||
// header text in a separate font + body that is Identity-H CID).
|
||
// Force ratio downward to trigger OCR fallback (parent spec §1.3 intent).
|
||
if had_marker {
|
||
let stripped_chars = s.len().saturating_sub(cleaned.len());
|
||
if stripped_chars > cleaned.len() {
|
||
// Marker dominates — cap ratio at 0.3 (below 0.5 OCR threshold).
|
||
// The 0.3 cap (not 0.0) preserves a small signal that some text
|
||
// WAS decodeable, useful for downstream metrics if ever exposed.
|
||
let mut total = 0u32;
|
||
let mut valid = 0u32;
|
||
for c in cleaned.chars() {
|
||
total += 1;
|
||
if is_valid_text_char(c) {
|
||
valid += 1;
|
||
}
|
||
}
|
||
let raw_ratio = if total == 0 { 0.0 } else { valid as f32 / total as f32 };
|
||
return raw_ratio.min(0.3);
|
||
}
|
||
}
|
||
// 4) Otherwise compute ratio on cleaned text (existing logic).
|
||
let mut total = 0u32;
|
||
let mut valid = 0u32;
|
||
for c in cleaned.chars() {
|
||
total += 1;
|
||
if is_valid_text_char(c) {
|
||
valid += 1;
|
||
}
|
||
}
|
||
if total == 0 {
|
||
return 0.0;
|
||
}
|
||
valid as f32 / total as f32
|
||
}
|
||
```
|
||
|
||
**Invariants preserved**:
|
||
- Function signature and return type unchanged (→ byte-identical caller surface).
|
||
- Existing character category logic (hangul, CJK, Latin-1) unmodified.
|
||
- Empty-string behavior (return 0.0) preserved.
|
||
|
||
### §4.2 Bug #6: Unit tests
|
||
|
||
Replace existing Bug #6 test set with two new tests reflecting marker-dominance heuristic:
|
||
|
||
```rust
|
||
#[test]
|
||
fn identity_h_marker_dominance_caps_ratio_below_threshold() {
|
||
// metro-korea.pdf-class: 20× marker (560 char) + 11 char ASCII header.
|
||
// Without dominance heuristic: ratio = 11/11 = 1.0 (bypasses OCR).
|
||
// With dominance heuristic: ratio ≤ 0.3 (triggers OCR fallback).
|
||
let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20));
|
||
let r = compute_valid_char_ratio(&s);
|
||
assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}");
|
||
}
|
||
|
||
#[test]
|
||
fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() {
|
||
// Inverse case: short marker noise + long valid text → ratio stays high
|
||
// (no false OCR trigger on otherwise-good pages).
|
||
let header = "x".repeat(200); // 200 char valid ASCII
|
||
let s = format!("{header} ?Identity-H Unimplemented?"); // 1× marker = 26 char
|
||
let r = compute_valid_char_ratio(&s);
|
||
assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}");
|
||
}
|
||
```
|
||
|
||
**Regression preservation**: Existing 8 tests (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) must all remain green.
|
||
|
||
### §4.3 Bug #7: CLI doc-comment diff
|
||
|
||
**File**: `crates/kebab-cli/src/main.rs` (SearchArgs field, lines ~150–160)
|
||
|
||
**Change**:
|
||
```diff
|
||
-/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `other`. Unknown values match nothing
|
||
+/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated. Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`, `image`, `audio`, `code`, `other`. Unknown values match nothing
|
||
```
|
||
|
||
### §4.3a Bug #7 integration: `integrations/claude-code/kebab/SKILL.md:57` simultaneous update
|
||
|
||
Per CLAUDE.md §Wire schema v1 invariant — in-tree integration docs must be synchronized when wire surface changes. This round has no wire schema change, but SKILL.md line 57 exhibits the same regression as §4.3 (Bug #7):
|
||
|
||
**File**: `integrations/claude-code/kebab/SKILL.md` (line 57)
|
||
|
||
**Change**:
|
||
```diff
|
||
-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`)
|
||
+`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`)
|
||
```
|
||
|
||
### §4.4 Bug #7: CLI help assertion
|
||
|
||
Add test to `crates/kebab-cli/tests/` (or extend existing help snapshot test):
|
||
|
||
```rust
|
||
#[test]
|
||
fn search_help_lists_code_in_media_values() {
|
||
let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab"))
|
||
.args(["search", "--help"])
|
||
.output()
|
||
.expect("kebab search --help");
|
||
let stdout = String::from_utf8_lossy(&out.stdout);
|
||
assert!(stdout.contains("`code`"), "search --help must list 'code' as accepted --media value");
|
||
}
|
||
```
|
||
|
||
### §4.5 Version cascade impact (CLAUDE.md §Versioning cascade)
|
||
|
||
- **parser_version**: `"pdf-text-v1"` — unchanged. Text-detect threshold is internal metric, not surface.
|
||
- **chunker_version**: `"pdf-page-v1.1"` — unchanged (no chunker logic affected).
|
||
- **wire schema**: No new fields, no schema version bump. `compute_valid_char_ratio()` is internal to `PdfTextExtractor::extract()`.
|
||
|
||
---
|
||
|
||
## §5 Acceptance criteria
|
||
|
||
- [ ] Text_quality unit test: `identity_h_marker_dominance_caps_ratio_below_threshold` passes.
|
||
- [ ] Text_quality unit test: `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` passes.
|
||
- [ ] Regression: All 8 existing text_quality tests remain green (no ratio behavior changes for valid text).
|
||
- [ ] CLI help assertion: `cargo test search_help_lists_code_in_media_values` passes.
|
||
- [ ] SKILL.md integration: `grep -F '"code"' integrations/claude-code/kebab/SKILL.md` returns ≥1 line.
|
||
- [ ] Full workspace test suite: `cargo test --workspace --no-fail-fast -j 1` green (clippy + unit + integration).
|
||
- [ ] Fresh binary builds: `CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli` succeeds.
|
||
|
||
---
|
||
|
||
## §6 Risks + open questions
|
||
|
||
### Identified risks
|
||
|
||
**R-1 — Other lopdf unimplemented markers** (resolved per critic round 1 probe): lopdf 0.32.0 emits exactly one marker — `?Identity-H Unimplemented?` at `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text`). Other CMap encoding arms (`UniCNS`, `UniJIS`, `UniKS`, `GBK-EUC`, `Adobe-*`) fall through to `String::from_utf8_lossy(bytes)` → PUA / replacement-char territory (already covered by `pure_pua_zero` test). Marker array adequacy = OK for current lopdf pin. **Re-verify on lopdf dependency upgrade.**
|
||
|
||
**R-2 — Whitespace-only edge case after stripping**: Handled by `.trim().is_empty()` check; returns 0.0 as intended.
|
||
|
||
**R-3 — Version/wire schema impact**: None. text_quality is internal threshold metric, not exposed to wire schema or version cascade.
|
||
|
||
**R-4 — Other `--media` help locations** (revised per critic): `--media` value list is scattered across 3 surfaces — `crates/kebab-cli/src/main.rs:157–159` (CLI doc-comment, covered by §4.3), `integrations/claude-code/kebab/SKILL.md:57` (skill doc, covered by §4.3a), `crates/kebab-cli/tests/cli_help_smoke.rs` (test, covered by §4.4). Plan drafter to run `grep '\bmedia\b' integrations/ crates/kebab-cli/src docs/wire-schema/v1` to confirm no additional surfaces exist.
|
||
|
||
**R-5 — Bulk mode media field parsing**: `crates/kebab-app/src/bulk.rs:161` handles media field parsing independently; string doc-comment update does not affect functional correctness.
|
||
|
||
### Open questions
|
||
|
||
**OQ-1 — Marker case sensitivity**: Does lopdf always emit markers in exact case `?Identity-H Unimplemented?`? Verify with lopdf source. If case variations exist, use case-insensitive matching or extend array.
|
||
|
||
**OQ-2 — Marker stripping threshold policy** (resolved via §4.1 marker-dominance heuristic): When stripped characters exceed remaining characters, ratio is capped at 0.3 to trigger OCR fallback. This ensures marker-dominant mixed pages (e.g., 99% marker + 1% valid header) do not bypass OCR despite the header's high ratio. Design intent (parent spec §1.3) is upheld: all mojibake pages trigger OCR fallback.
|
||
|
||
**OQ-3 — Alias expansion scope**: Bug #7 explicitly omits new aliases (e.g., `src` → `code`). Single additive fix to doc-comment, no enum variant changes.
|
||
|
||
### UX consequence — pre-bugfix2 v0.20 user's `--force` re-ingest
|
||
|
||
This round preserves version cascade (no `parser_version` bump). The `try_skip_unchanged` path will match files indexed pre-bugfix2 with same `parser_version="pdf-text-v1"` + hash. Pre-indexed `metro-korea.pdf`-class pages will NOT automatically re-route through the corrected text-detect → OCR fallback.
|
||
|
||
**User action required**: Explicit `kebab ingest --force-reingest <workspace>` to purge cached skip decisions and re-process affected files.
|
||
|
||
**Release notes** (v0.20.1 or whichever version ships this bugfix) **MUST include**: "If you indexed mojibake-heavy PDFs (esp. metro-korea.pdf class) on v0.20.0 pre-bugfix2, run `kebab ingest --force-reingest <workspace>` to apply the improved text detection. Otherwise, `ingest` will skip unchanged files and OCR fallback will not trigger." + link to design §9 cascade explanation.
|
||
|
||
**Documentation updates** (same PR as code): README + HANDOFF + ARCHITECTURE per `feedback_readme_sync_rule` memory — mention the `--force-reingest` step in release highlights or migration notes.
|
||
|
||
Deliberate design: automatic migration risks wedging stable v0.20.0 KBs. Manual `--force-reingest` is the correct escape hatch (parent spec §1.7 line 126–128 precedent).
|
||
|
||
---
|
||
|
||
## §7 References
|
||
|
||
- **Parent spec**: docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md §1.3 (line 74), §1.4, §9
|
||
- **Dogfood evidence**: .omc/reviews/2026-05-27-v0.20-bugfix-dogfood-report.md §Bug #6, §Bug #7
|
||
- **Critic result**: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md (findings H-1 through NIT-2, parent invariant audit)
|
||
- **External source**: lopdf-0.32.0/src/document.rs:523 (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?` marker)
|
||
- **Code locations**:
|
||
- text_quality.rs: `crates/kebab-parse-pdf/src/text_quality.rs:9-106`
|
||
- CLI help: `crates/kebab-cli/src/main.rs:157–159`
|
||
- Skill integration: `integrations/claude-code/kebab/SKILL.md:57`
|
||
- CLI test: `crates/kebab-cli/tests/` (search_help_lists_code_in_media_values)
|
||
|
||
---
|
||
|
||
**Status**: Round 1c rewrite COMPLETE. All 9 critic findings (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit) applied in-session.
|
||
|
||
**Prior round reference**: Round 1 commits (d9acda5, 436fd01, 241ded5, e674ff4) are merged on branch; this round is independent (text_quality.rs vs. source-fs/connector.rs + chunk/pdf_page_v1.rs).
|