docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md
3-round dogfood-driven fix cycle 의 산출물: - bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line - bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line - bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line - docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus) 각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
849
docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md
Normal file
849
docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md
Normal file
@@ -0,0 +1,849 @@
|
||||
---
|
||||
title: v0.20.0 sub-item 1 bugfix — implementation plan
|
||||
created: 2026-05-27
|
||||
status: ACCEPT (round 2 closure — Phase B complete)
|
||||
target_version: 0.20.0 (PR #189 force-update)
|
||||
spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md
|
||||
contract_sections: ["§9 (chunker_version cascade)"]
|
||||
parent_plan: docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md
|
||||
review_history:
|
||||
- 2026-05-27 plan round 0 (opus, drafter) — 5 step group A-E, 18 sub-action
|
||||
- 2026-05-27 plan round 1 critic (opus, thorough) — NEEDS_DISCUSSION, HIGH 1 + MEDIUM 2 + LOW 3 + NIT 1 (7 finding)
|
||||
- 2026-05-27 plan round 1 verifier (opus, thorough) — NEEDS_DISCUSSION, HIGH 4 + MEDIUM 3 + LOW 4 + NIT 3 (14 finding)
|
||||
- 2026-05-27 plan round 1c rewrite (opus, drafter) — 21 finding 모두 적용 (critic 7 + verifier 14). detail = §8 round 1c rewrite changelog
|
||||
- 2026-05-27 plan round 2 closure critic (opus) — ACCEPT, 21/21 applied + 4 NIT cosmetic
|
||||
---
|
||||
|
||||
# v0.20.0 sub-item 1 bugfix plan
|
||||
|
||||
> ACCEPT 된 spec (`docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`, 965 lines, round 2 closure) 의 step decomposition. 3 bug (#2 walker code limit / #3 chunk_id collision Critical / #4 F4 fixture Pages tree) 의 force-update path (PR #189 base branch `feat/pdf-scanned-ocr` 위에 fix commits stack). **5 step (Group A-E), 18 sub-action, 4 commit + 1 verify-only step (= 5 step total, 4 commit boundary).** spec §6 의 16-row acceptance 가 본 plan 의 §4 verifier checklist 로 1:1 mapping.
|
||||
|
||||
## §0 Pre-flight + branch state
|
||||
|
||||
- **Branch**: `feat/pdf-scanned-ocr` (PR #189 base, HEAD = `b4d9e60` "chore(release): bump version 0.19.0 → 0.20.0"). 사용자 메모리 `feedback_pr_workflow` 따라 force-update path — 같은 branch 위에 fix commits 추가 + PR #189 force-push.
|
||||
- **Working dir**: `/home/altair823/kebab`.
|
||||
- **Env 강제** (`~/.claude/CLAUDE.md` "Disk Layout — 루트 디스크 보호가 최우선"):
|
||||
- `export CARGO_TARGET_DIR=/build/out/cargo-target/target` — XFS 4TB 전용 디스크 격리, repo root `target/` 생성 방지.
|
||||
- `export RELEASE_BIN="${CARGO_TARGET_DIR:-target}/release/kebab"` — release binary alias (Step 5 dogfood 의 모든 acceptance command 에서 사용).
|
||||
- `export TMPDIR=/build/cache/tmp` — 대용량 임시 파일 보호.
|
||||
- **Cargo build 직렬화** (memory `feedback_serial_build_only`):
|
||||
- per-crate: `-j 4` default (예: `cargo test -p kebab-chunk -j 4`).
|
||||
- workspace: `-j 1` 강제 (`cargo test --workspace -j 1`, `cargo clippy --workspace -j 1` — 18 integration-test binary 동시 link 시 OOM).
|
||||
- **`target/` clean policy** (memory `feedback_cargo_clean_policy`): `/build` XFS 4TB 분리라 routinely clean 금지. `df -h /build` 의 `Avail < 500G` OR `du -sh $CARGO_TARGET_DIR` > 200G 시에만 clean. Step 5 E1 의 first cargo invoke 직전 1 회 conditional check, 임계 미달 시 skip + commit body 안 "skipped cargo clean — /build avail X TB" 1줄 record.
|
||||
- **dogfood KB layout 가정** (Step 5 E3 prerequisite, critic round 1 H-1 closure): canonical config path = `/build/cache/tmp/v0.20-dogfood/config.toml` (in-place, KB dir 안). 외부 backup file `/build/cache/tmp/v0.20-dogfood-config.toml` 은 **존재 안 함** — 본 plan 의 모든 acceptance command 는 in-place config 기준. Step 5 E3 의 KB clean 은 **destructive `rm -rf` 금지**, config 보존 selective clean 사용 (E3 detail 참조). dogfood config canonical path 는 본 §0 의 한 곳에서만 정의 — Step 5 E3 의 command 가 이 path 참조.
|
||||
- **HOTFIXES.md / README / HANDOFF / ARCHITECTURE 영향**: Step 2 B4 가 HOTFIXES.md entry 추가 (Bug #3 second-iteration patch). 그 외 사용자 visible surface 변경 0 — README + HANDOFF + ARCHITECTURE 갱신 0 (CLI flag / wire schema / TUI key / config 추가 0; chunker_version bump 은 internal cascade 라 release notes 만).
|
||||
- **wire schema 변경 0** — `ingest_progress.v1` + `ingest_report.v1` 추가 field 0. V00X migration 0. `chunks` table DDL unchanged.
|
||||
- **frozen design contract 변경 0** — design §9 cascade rule 자체 변경 0 (rule 의 직접 적용으로 chunker_version 만 bump).
|
||||
- **workspace version bump 0** — v0.20.0 이 이미 cut (commit `b4d9e60`). 본 plan 은 같은 v0.20.0 안의 cumulative bugfix (PR #189 force-update). Step 5 E5 의 PR force-push 만, release tag 재컷 0.
|
||||
|
||||
## §1 Plan overview + spec linkage
|
||||
|
||||
Spec §3 (Bug #2) + §4 (Bug #3) + §5 (Bug #4) 의 fix design 을 atomic step 으로 decompose. 핵심 sequencing:
|
||||
|
||||
1. **Bug #2 walker code limit fix** (Step 1) — `is_code_file` helper + walker conditional + unit test. spec §3.4 + §3.5 의 diff 그대로 적용. 1 commit.
|
||||
2. **Bug #3 chunk_id fix + chunker_version bump** (Step 2) — `chunk_page` return tuple 4-tuple 확장 + caller `per_chunk_hash` suffix 를 `segment_start` 로 변경 + `VERSION_LABEL` `"pdf-page-v1"` → `"pdf-page-v1.1"` bump + module doc 갱신 + HOTFIXES.md entry + unit regression test. spec §4.4 + §4.4.1 + §4.5 의 diff 그대로 적용. 1 commit.
|
||||
3. **Bug #3 integration test** (Step 3) — `crates/kebab-app/tests/` 안 multi-scanned PDF chunk_id collision-free integration test. spec §4.5.1 의 MockOcrEngine pre-condition 결정 (Option A share 또는 Option B inline) 이 executor 의 first sub-action. 1 commit.
|
||||
4. **Bug #4 F4 fixture re-generation** (Step 4) — `tests/fixtures/_synth/mojibake.py` 의 pikepdf-based rewrite + F4 fixture binary regenerate + parse-pdf 의 3 신규 invariant test. spec §5.4 + §5.5 + §5.6 의 diff 그대로 적용. 1 commit.
|
||||
5. **Workspace verify + commit + PR force-push** (Step 5) — cargo workspace test `-j 1` + clippy `-D warnings` + dogfood re-run (`/build/cache/tmp/v0.20-dogfood` isolated KB, qwen2.5vl:3b 의 Ollama endpoint `192.168.0.47:11434`) + PR #189 force-push. spec §6 16-row consolidated acceptance 가 본 step 의 verifier checklist.
|
||||
|
||||
ordering invariant:
|
||||
|
||||
- **Step 1 || Step 2 || Step 4 mutually independent**: 3 bug 의 fix 가 서로 다른 crate (`kebab-source-fs` / `kebab-chunk` / `tests/fixtures` + `kebab-parse-pdf`) 의 file path 에 한정 — 동시 진행 가능. 정합성 우선 → Step 1 → Step 2 → Step 4 sequential.
|
||||
- **Step 2 < Step 3**: integration test 가 `kebab-chunk` 의 fix 된 chunk_id 계산 path 위에 의존. Step 2 의 GREEN 이 prerequisite.
|
||||
- **Step 4 < Step 5 dogfood**: F4 fixture regeneration 의 결과 binary 가 dogfood 의 9 PDF 중 1 (mojibake) — Step 5 E3 dogfood 의 `block_count: 1` invariant 검증 prerequisite.
|
||||
- **Step 1-4 all < Step 5 workspace test**: workspace 전체 test 가 production code + test 의 final state 위에서만 의미.
|
||||
|
||||
commit 단위는 logical group 1 commit (atomic) — §7 sequencing summary 의 5-commit table 따름. 사용자 memory `feedback_pr_workflow` (gitea-pr + 리뷰 루프) 따라 force-update 후 `gitea-pr-review` skill 의 review 루프 진입.
|
||||
|
||||
---
|
||||
|
||||
## §2 Step group structure (Group A-E)
|
||||
|
||||
| Step | Group | 분류 | sub-action |
|
||||
|---:|---|---|---|
|
||||
| 1 | A | Bug #2 walker code limit fix | A1 `is_code_file` helper + A2 walker conditional + A3 unit test |
|
||||
| 2 | B | Bug #3 chunk_id collision fix + chunker_version bump | B1 `chunk_page` 4-tuple + B2 caller `per_chunk_hash` + B3 `VERSION_LABEL` bump + B4 module doc + HOTFIXES.md + B5 unit regression test |
|
||||
| 3 | C | Bug #3 multi-scanned PDF integration test | C1 MockOcrEngine share decision + C2 integration test (conditional) |
|
||||
| 4 | D | Bug #4 F4 fixture re-generation | D1 mojibake.py pikepdf rewrite + D2 fixture regenerate + commit + D3 parse-pdf 3 invariant test |
|
||||
| 5 | E | Workspace verify + commit + PR force-push | E1 cargo workspace test -j 1 + E2 clippy -D warnings + E3 dogfood re-run + E4 commit + E5 PR #189 force-push |
|
||||
|
||||
---
|
||||
|
||||
## §3 Per-step detail
|
||||
|
||||
### Step 1 (Group A): Bug #2 walker code limit fix
|
||||
|
||||
spec §3 의 Option A (code path only) — `is_oversized` 호출을 `is_code_file(path)` conditional 로 gate. PDF/image/markdown 의 size 는 parser 단계 자체 검증 (lopdf load_mem 256 KB+ 정상, image OCR 의 max_pixels self-cap).
|
||||
|
||||
#### Sub-action A1 — `is_code_file` helper 추가
|
||||
|
||||
- **Files affected**: `crates/kebab-source-fs/src/code_meta.rs` (line 129 `is_oversized` 함수 직후, 또는 `code_lang_for_path` 정의 직후).
|
||||
- **Action** (spec §3.4 diff 그대로):
|
||||
```rust
|
||||
/// Returns true when `path`'s filename/extension is recognised as a code
|
||||
/// file (per `code_lang_for_path`). Used by the walker to apply
|
||||
/// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files,
|
||||
/// not to PDF/image/markdown (which have their own size controls in
|
||||
/// their respective parsers).
|
||||
pub(crate) fn is_code_file(path: &Path) -> bool {
|
||||
code_lang_for_path(path).is_some()
|
||||
}
|
||||
```
|
||||
- **Acceptance**:
|
||||
- `grep -c "pub(crate) fn is_code_file" crates/kebab-source-fs/src/code_meta.rs` = **1**.
|
||||
- `cargo build -p kebab-source-fs -j 4` green.
|
||||
|
||||
#### Sub-action A2 — walker conditional size check
|
||||
|
||||
- **Files affected**: `crates/kebab-source-fs/src/connector.rs:168-190` (현재 verified line range).
|
||||
- **Action** (spec §3.4 diff 그대로 — `is_oversized` 호출 앞에 `is_code_file` short-circuit):
|
||||
```diff
|
||||
- // Size-cap check (byte or line limit).
|
||||
- if crate::code_meta::is_oversized(
|
||||
- &abs_path,
|
||||
- self.max_file_bytes,
|
||||
- self.max_file_lines,
|
||||
- )
|
||||
- .unwrap_or(false)
|
||||
+ // v0.20.0 sub-item 1 bugfix (#2): size-cap applies ONLY to
|
||||
+ // code files. PDF/image/markdown bypass — their parsers
|
||||
+ // have their own size controls. spec §3.3.
|
||||
+ if crate::code_meta::is_code_file(&abs_path)
|
||||
+ && crate::code_meta::is_oversized(
|
||||
+ &abs_path,
|
||||
+ self.max_file_bytes,
|
||||
+ self.max_file_lines,
|
||||
+ )
|
||||
+ .unwrap_or(false)
|
||||
{
|
||||
fs_skips.skipped_size_exceeded =
|
||||
fs_skips.skipped_size_exceeded.saturating_add(1);
|
||||
...
|
||||
tracing::debug!(
|
||||
path = %rel_path.display(),
|
||||
max_bytes = self.max_file_bytes,
|
||||
max_lines = self.max_file_lines,
|
||||
- "skip: file exceeds size cap"
|
||||
+ "skip: code file exceeds size cap"
|
||||
);
|
||||
continue;
|
||||
}
|
||||
```
|
||||
- **Acceptance**:
|
||||
- `grep -nE "is_code_file\(&abs_path\)\s*$" crates/kebab-source-fs/src/connector.rs` ≥ **1**.
|
||||
- `grep -c "skip: code file exceeds size cap" crates/kebab-source-fs/src/connector.rs` ≥ **1**.
|
||||
- `cargo build -p kebab-source-fs -j 4` green.
|
||||
|
||||
#### Sub-action A3 — Bug #2 unit test 추가
|
||||
|
||||
- **Files affected**: `crates/kebab-source-fs/src/connector.rs` 의 기존 `#[cfg(test)] mod tests` (spec §3.5 "기존 test module 에 추가" 명시 — 새 file 아님).
|
||||
- **Action** (spec §3.5 의 `size_cap_skips_only_code_files` test body 그대로):
|
||||
- 300 KB PDF / 300 KB markdown / 300 KB `big.rs` (3 file) tempdir 합성.
|
||||
- `FsSourceConnector` (`max_file_bytes = 262_144`, `max_file_lines = 5_000`) 의 `scan_with_skips(&SourceScope::default())`.
|
||||
- assertions:
|
||||
- `paths.contains("paper.pdf")` (PDF walker pass).
|
||||
- `paths.contains("notes.md")` (Markdown walker pass).
|
||||
- `!paths.contains("big.rs")` (code file walker skip).
|
||||
- `skips.skip_examples.size_exceeded` 안 `big.rs` 1 entry, `paper.pdf` 0 entry.
|
||||
- cfg helper: 기존 test module 의 `cfg_with_size_cap(root, max_bytes, max_lines)` 패턴 재사용 (필요 시 helper 추가).
|
||||
- **Acceptance**:
|
||||
- `cargo test -p kebab-source-fs size_cap_skips_only_code_files -j 4` green.
|
||||
- 기존 `ingest_report_counts_oversized_files_by_bytes` (fixture `huge.rs`) + `ingest_report_size_cap_by_line_count` (fixture `longfile.rs`) regression 0 — fixture 명이 `.rs` 라 새 conditional 통과 (invariant preserved).
|
||||
- `cargo test -p kebab-source-fs -j 4` 전체 green.
|
||||
|
||||
#### Commit (Step 1 전체)
|
||||
|
||||
```
|
||||
fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)
|
||||
|
||||
- crates/kebab-source-fs/src/code_meta.rs: add pub(crate) fn is_code_file
|
||||
- crates/kebab-source-fs/src/connector.rs: walker conditional `is_code_file && is_oversized`
|
||||
- crates/kebab-source-fs/src/connector.rs mod tests: size_cap_skips_only_code_files unit test
|
||||
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2 (Group B): Bug #3 chunk_id collision fix + chunker_version bump
|
||||
|
||||
spec §4.3 의 Option A (segment boundary `start` 를 `per_chunk_hash` suffix 로). `chunk_page` return tuple 을 3-tuple `(actual_start, chunk_end, slice)` → 4-tuple `(segment_start, actual_start, chunk_end, slice)` 로 확장 + caller `per_chunk_hash` suffix 를 `segment_start` 로 변경. `VERSION_LABEL` `"pdf-page-v1"` → `"pdf-page-v1.1"` bump (spec §4.4.1 round 1c M-1 decision — explicit cascade audit trail).
|
||||
|
||||
#### Sub-action B1 — `chunk_page` 4-tuple expansion
|
||||
|
||||
- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs:200-204` (doc comment) + `:205-289` (signature line 205 → closing `}` line 289). 본 critic round 1 + verifier round 1 의 actual probe 결과 정정 (L-1).
|
||||
- **Action** (spec §4.4 diff 그대로):
|
||||
- doc comment 갱신 — `(char_start, char_end, text_slice)` → `(segment_start, actual_start, chunk_end, text_slice)`:
|
||||
```rust
|
||||
/// Split a single page's text into ordered chunks, each represented as
|
||||
/// `(segment_start, actual_start, chunk_end, text_slice)`.
|
||||
///
|
||||
/// - `segment_start` = pre-overlap segment boundary. Strictly increasing
|
||||
/// across the returned vec. Use this for chunk_id uniqueness suffixes.
|
||||
/// - `actual_start` = post-overlap start char index. May collapse to a
|
||||
/// previous chunk's `actual_start` under aggressive overlap policy.
|
||||
/// Use this for `SourceSpan::Page::char_start`.
|
||||
/// - `chunk_end` = chunk's end char index (exclusive).
|
||||
fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize)
|
||||
-> Vec<(usize, usize, usize, String)>
|
||||
```
|
||||
- early return: `vec![(0, n, text.to_string())]` → `vec![(0, 0, n, text.to_string())]`.
|
||||
- loop body 의 push: `chunks.push((actual_start, chunk_end, slice))` → `chunks.push((start, actual_start, chunk_end, slice))`. (`start = bounds[seg_idx]` 는 이미 local var 로 존재 — line 245.)
|
||||
- overlap walk 의 `let prev_min = prev.0` 가 기존 tuple 의 첫 field = post-fix tuple shape 에서는 `prev.1` (actual_start) — spec §4.4 의 invariant 보존 위해 변경:
|
||||
```diff
|
||||
- let actual_start = if let Some(prev) = chunks.last() {
|
||||
- let prev_min = prev.0;
|
||||
+ let actual_start = if let Some(prev) = chunks.last() {
|
||||
+ // prev tuple shape = (segment_start, actual_start, chunk_end, slice).
|
||||
+ // overlap walk floor = previous chunk's actual_start (prev.1).
|
||||
+ let prev_min = prev.1;
|
||||
...
|
||||
```
|
||||
- **Acceptance**:
|
||||
- `grep -nE "fn chunk_page.*-> Vec<\(usize, usize, usize, String\)>" crates/kebab-chunk/src/pdf_page_v1.rs` = **1**.
|
||||
- `grep -c "let prev_min = prev.1" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **1**.
|
||||
- `cargo build -p kebab-chunk -j 4` green (caller B2 sub-action 동시 적용 후 red 해소).
|
||||
|
||||
#### Sub-action B2 — caller `per_chunk_hash` suffix → `segment_start`
|
||||
|
||||
- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs:149-186` (현재 verified — `chunk` method 의 `for (...) in chunk_page(...)` loop start line 149 → loop end line 186, verifier round 1 L-2 정정).
|
||||
- **Action** (spec §4.4 diff 그대로):
|
||||
```diff
|
||||
- for (char_start, char_end, slice) in
|
||||
- chunk_page(&p.text, target_bytes, overlap_bytes)
|
||||
+ for (segment_start, char_start, char_end, slice) in
|
||||
+ chunk_page(&p.text, target_bytes, overlap_bytes)
|
||||
{
|
||||
...
|
||||
let span = SourceSpan::Page {
|
||||
page: page_num,
|
||||
char_start: Some(char_start_u32),
|
||||
char_end: Some(char_end_u32),
|
||||
};
|
||||
let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
|
||||
- // Per-chunk policy_hash variant prevents chunk_id
|
||||
- // collision when a page produces multiple chunks. See
|
||||
- // module docs for rationale.
|
||||
- let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
|
||||
+ // v0.20.0 sub-item 1 bugfix (#3): per-chunk policy_hash
|
||||
+ // variant uses `segment_start` (pre-overlap boundary,
|
||||
+ // strictly increasing) instead of `char_start` (post-
|
||||
+ // overlap, may collapse to prev_min). See module docs +
|
||||
+ // spec §4.1 root cause + HOTFIXES.md 2026-05-27.
|
||||
+ let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
|
||||
let chunk_id =
|
||||
id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
|
||||
...
|
||||
}
|
||||
```
|
||||
- `SourceSpan::Page.char_start` 는 여전히 post-overlap `char_start` (= `actual_start`) 보존 — citation locality semantic 유지.
|
||||
- **Acceptance** (verifier round 1 M-2: B2+B4 가 같은 logical commit 안 → grep 시점 = Step 2 commit time, 즉 post-B4):
|
||||
- `grep -c "#c{segment_start}" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **1** (B2 단독 적용 시 = 1 call site; B4 module doc 적용 후 = 2 — B4 acceptance 가 ≥ 2 검증).
|
||||
- `grep -c "#c{char_start}" crates/kebab-chunk/src/pdf_page_v1.rs` = **0** (call site + module doc 모두 segment_start 로 교체 — B2+B4 의 same-commit consolidated invariant).
|
||||
- sub-action-by-sub-action 분리 검증 시 B2 단독 grep `#c{char_start}` 는 module doc line 56 의 literal 잔존으로 ≥ 1 — Step 2 commit boundary 도달 후 = 0 으로 확정.
|
||||
|
||||
#### Sub-action B3 — `VERSION_LABEL` bump `"pdf-page-v1"` → `"pdf-page-v1.1"` + hardcoded literal 2 site 갱신
|
||||
|
||||
- **Files affected** (verifier round 1 H-1 의 actual probe `grep -rn '"pdf-page-v1"' crates/ --include='*.rs'` 결과 2 site enumerate):
|
||||
- `crates/kebab-chunk/src/pdf_page_v1.rs:67` (현재 verified — `const VERSION_LABEL: &str = "pdf-page-v1";`).
|
||||
- `crates/kebab-app/tests/pdf_pipeline.rs:168` (현재 verified — `assert_eq!(pdf_item.chunker_version.as_ref().map(|c| c.0.as_str()), Some("pdf-page-v1"))` hard assertion, v1.1 bump 후 fail).
|
||||
- `crates/kebab-app/tests/pdf_pipeline.rs:368` (현재 verified — error message string literal `"pdf-page-v1 emits 0 chunks for the empty page; total = 2"`, hard assertion 아니지만 stale 방지).
|
||||
- **Action** (spec §4.4.1 결정):
|
||||
- **(a) primary const bump** (`crates/kebab-chunk/src/pdf_page_v1.rs:67`):
|
||||
```diff
|
||||
-const VERSION_LABEL: &str = "pdf-page-v1";
|
||||
+const VERSION_LABEL: &str = "pdf-page-v1.1";
|
||||
```
|
||||
기존 test `chunker_version_is_pdf_page_v1` (pdf_page_v1.rs:374) 의 assertion 은 `VERSION_LABEL` const 인용 → 자동 갱신, test code 변경 불요.
|
||||
- **(b) test assertion literal 갱신** (`crates/kebab-app/tests/pdf_pipeline.rs:168`, required):
|
||||
```diff
|
||||
- Some("pdf-page-v1")
|
||||
+ Some("pdf-page-v1.1")
|
||||
```
|
||||
- **(c) test error message literal 갱신** (`crates/kebab-app/tests/pdf_pipeline.rs:368`, recommended):
|
||||
```diff
|
||||
- "pdf-page-v1 emits 0 chunks for the empty page; total = 2"
|
||||
+ "pdf-page-v1.1 emits 0 chunks for the empty page; total = 2"
|
||||
```
|
||||
- **Acceptance**:
|
||||
- `grep -nE 'const VERSION_LABEL: &str = "pdf-page-v[0-9.]+";' crates/kebab-chunk/src/pdf_page_v1.rs` 결과 = `"pdf-page-v1.1"`.
|
||||
- `cargo test -p kebab-chunk chunker_version_is_pdf_page_v1 -j 4` green (VERSION_LABEL 인용이라 자동 통과).
|
||||
- `grep -rn '"pdf-page-v1"' crates/ --include='*.rs' | grep -v 'pdf-page-v1\.1'` = 결과 **0** (regex 의 false-positive 방지 — `pdf-page-v1.1` 의 substring `"pdf-page-v1"` 은 ".1" suffix 로 exclude). `grep -v` filter 후 line 0 이면 stale literal 잔존 0.
|
||||
- `cargo test -p kebab-app pdf_pipeline -j 4` green (line 168 assertion 갱신 후).
|
||||
|
||||
#### Sub-action B4 — module doc 갱신 + HOTFIXES.md entry
|
||||
|
||||
- **Files affected**:
|
||||
- `crates/kebab-chunk/src/pdf_page_v1.rs:47-60` (현재 verified — module doc `## chunk_id collision deviation` 단락).
|
||||
- `tasks/HOTFIXES.md` (new dated entry append, 기존 entry 위치 — file 의 latest entry 가 `2026-05-26` 이므로 그 위에 `2026-05-27 — v0.20.0 sub-item 1` entry insert; 본 file 의 chronological pattern 따름).
|
||||
- **Action**:
|
||||
- **(a) module doc** — spec §4.4 의 갱신본 그대로:
|
||||
```diff
|
||||
-//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
|
||||
-//! variant `format!("{base_policy_hash}#c{char_start}")` into the
|
||||
-//! recipe's `policy_hash` slot (so distinct chunks distinguish via
|
||||
-//! different policy_hash inputs), while storing the unmodified
|
||||
-//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers
|
||||
-//! "what policy was active". Logged in `tasks/HOTFIXES.md`.
|
||||
+//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
|
||||
+//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
|
||||
+//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap
|
||||
+//! segment boundary, strictly increasing across the returned chunks
|
||||
+//! even when the overlap walk collapses `actual_start` to a previous
|
||||
+//! chunk's `prev_min`. Unmodified `base_policy_hash` is stored in
|
||||
+//! `Chunk.policy_hash` so the field still answers "what policy was
|
||||
+//! active". v1.1 second-iteration patch — logged in
|
||||
+//! `tasks/HOTFIXES.md` (2026-05-27).
|
||||
```
|
||||
- **(b) HOTFIXES.md entry** (spec §4.4 의 entry body 그대로):
|
||||
```markdown
|
||||
## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround collapses under aggressive overlap (Bug #3 second-iteration patch)
|
||||
|
||||
**Symptom**: F2 (1580 chars OCR, scanned_page2.pdf) ingest 시
|
||||
`DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint
|
||||
failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint
|
||||
failed`. `kebab v0.20.0` (commit `b4d9e60`) dogfood (qwen2.5vl:3b 의
|
||||
`192.168.0.47:11434` Ollama endpoint, `/build/cache/tmp/v0.20-dogfood`
|
||||
isolated KB) `--force-reingest` 마다 reproducible.
|
||||
|
||||
**Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170` 의
|
||||
`per_chunk_hash = format!("{base_policy_hash}#c{char_start}")` 에서
|
||||
`char_start` = post-overlap `actual_start`. line 266-281 의 overlap
|
||||
walk 가 `prev_min` floor 까지만 back-walk 하므로 aggressive overlap
|
||||
+ 첫 segment 가 작은 page (F2 의 한국어 OCR text: 첫 ~10 char 안
|
||||
sentence-end → segment_1 = [0, 30], segment_2 = [30, n], overlap_bytes
|
||||
240 / chars=80 → segment_2 의 actual_start 가 prev_min=0 으로
|
||||
collapse) → 두 chunk 의 `#c0` suffix identical → identical chunk_id →
|
||||
`chunks` PRIMARY KEY violation.
|
||||
|
||||
**Fix** (spec §4.4): `chunk_page` return tuple 에 `segment_start`
|
||||
추가 (3-tuple → 4-tuple `(segment_start, actual_start, chunk_end,
|
||||
slice)`), caller `per_chunk_hash` 의 suffix 를 `segment_start` 로
|
||||
변경. `segment_start` 는 `bounds[seg_idx]` (dedup 후 strictly
|
||||
increasing) — overlap walk 와 무관하게 모든 chunk distinct. citation
|
||||
locality 의 `SourceSpan::Page.char_start` 는 여전히 post-overlap
|
||||
`actual_start` 유지.
|
||||
|
||||
**chunker_version cascade**: `pdf-page-v1` → `pdf-page-v1.1` bump
|
||||
(spec §4.4.1 round 1c M-1 결정, design §9 cascade rule 의 직접 적용).
|
||||
multi-chunk PDF page (pre-OCR 시점 `metro-korea.pdf` 의 21 block /
|
||||
34 chunk 같은 정상 path) 의 chunk_id 가 변경 — explicit user-facing
|
||||
audit trail 확보, store layer 의 자동 invalidation report. v0.20.0
|
||||
force-update path 라 사용자 cost zero (어차피 fresh ingest).
|
||||
|
||||
**Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
|
||||
§4.4. parent design §4.2 chunk_id recipe 자체 unchanged (workaround
|
||||
layer 의 internal computation 만 변경). parent PR #189
|
||||
(`feat/pdf-scanned-ocr`, force-update path).
|
||||
```
|
||||
- **Acceptance**:
|
||||
- `grep -c "#c{segment_start}" crates/kebab-chunk/src/pdf_page_v1.rs` ≥ **2** (module doc + line 170 의 actual call).
|
||||
- `grep -c "2026-05-27 — v0.20.0 sub-item 1: chunk_id" tasks/HOTFIXES.md` = **1**.
|
||||
|
||||
#### Sub-action B5 — unit regression test `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids`
|
||||
|
||||
- **Files affected**: `crates/kebab-chunk/src/pdf_page_v1.rs` 의 `#[cfg(test)] mod tests` (현재 verified — `make_pdf_doc(&[&str])` + `default_policy(target, overlap)` helper 이미 존재, line 300-371).
|
||||
- **Action** (spec §4.5 의 test body 그대로):
|
||||
```rust
|
||||
#[test]
|
||||
fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
|
||||
// 한국어 OCR text 의 trigger shape: 10 char "가" + ". " + 500 char "나".
|
||||
// → first segment [0, 12), second segment [12, n).
|
||||
// page_text byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500
|
||||
// → multi-chunk. overlap_bytes = min(240, 750) = 240 chars=80
|
||||
// → second chunk 의 actual_start 가 prev_min=0 collapse → same `#c0`.
|
||||
//
|
||||
// default_policy(500, 80) — target_tokens=500 → target_bytes=500*3=1500
|
||||
// (한국어 3byte/char 환산), overlap_tokens=80 → overlap_bytes=min(240, 750)=240.
|
||||
// verifier round 1 L-3 보강.
|
||||
let early_seg: String = std::iter::repeat('가').take(10).collect();
|
||||
let tail: String = std::iter::repeat('나').take(500).collect();
|
||||
let page_text = format!("{early_seg}. {tail}");
|
||||
|
||||
let doc = make_pdf_doc(&[&page_text]);
|
||||
let policy = default_policy(500, 80); // target=1500 byte, overlap=240 byte
|
||||
let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
|
||||
|
||||
assert!(
|
||||
chunks.len() >= 2,
|
||||
"expected ≥2 chunks for {} byte page; got {}",
|
||||
page_text.len(),
|
||||
chunks.len()
|
||||
);
|
||||
|
||||
let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
|
||||
ids.sort_unstable();
|
||||
let total = ids.len();
|
||||
ids.dedup();
|
||||
assert_eq!(
|
||||
ids.len(),
|
||||
total,
|
||||
"all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
|
||||
);
|
||||
}
|
||||
```
|
||||
- **Acceptance**:
|
||||
- `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids -j 4` green.
|
||||
- `cargo test -p kebab-chunk deterministic_chunk_ids_1000 -j 4` green (기존 determinism invariant 보존).
|
||||
- `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target -j 4` green (기존 overlap clamp invariant 보존).
|
||||
- `cargo test -p kebab-chunk -j 4` 전체 green.
|
||||
|
||||
#### Commit (Step 2 전체)
|
||||
|
||||
```
|
||||
fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)
|
||||
|
||||
- crates/kebab-chunk/src/pdf_page_v1.rs: chunk_page returns 4-tuple
|
||||
(segment_start, actual_start, chunk_end, slice); caller per_chunk_hash
|
||||
suffix uses segment_start (pre-overlap boundary, strictly increasing)
|
||||
instead of char_start (post-overlap, may collapse to prev_min).
|
||||
- VERSION_LABEL "pdf-page-v1" → "pdf-page-v1.1" (design §9 cascade,
|
||||
explicit user-facing audit trail).
|
||||
- module docs: workaround description updated to segment_start.
|
||||
- mod tests: multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids
|
||||
regression pin.
|
||||
- tasks/HOTFIXES.md: 2026-05-27 entry (symptom F2 1580 char OCR,
|
||||
intra-doc collision root cause, second-iteration patch rationale).
|
||||
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §4
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3 (Group C): Bug #3 multi-scanned PDF integration test
|
||||
|
||||
spec §4.5 + §4.5.1 — `kebab-app` integration 수준의 chunk_id collision-free regression. real Ollama 회피 위해 `OcrEngine` trait 의 MockOcrEngine. 기존 `crates/kebab-app/tests/pdf_ocr_apply.rs:20-45` 의 private MockOcrEngine 가 같은 crate 의 별 test binary 라 직접 import 불가 — executor 의 first sub-action 으로 share path 결정.
|
||||
|
||||
#### Sub-action C1 — MockOcrEngine share decision (executor 의 dependency 확인 task)
|
||||
|
||||
- **Files affected** (Option 별 분기):
|
||||
- **Option A (share via `tests/common/`)** — verifier round 1 H-2 의 actual probe 결과 정정:
|
||||
- `crates/kebab-app/tests/common/mod.rs` 는 **이미 존재** (172 line `TestEnv` infrastructure, `#![allow(dead_code)]` + `pub struct TestEnv` + `pub fn ingest_md` + `pub fn lexical_query` 등). action = **`pub mod mock_ocr;` 1줄 append** (mod.rs 신규 X).
|
||||
- `crates/kebab-app/tests/common/mock_ocr.rs` (**신규** file, MockOcrEngine lift + per-page ctor).
|
||||
- 기존 `pdf_ocr_apply.rs:20-45` 의 MockOcrEngine struct + impl 제거 + `mod common; use common::mock_ocr::MockOcrEngine;` import 추가 + ctor call site migration (M-3 참조).
|
||||
- 신규 integration test 가 `mod common; use common::mock_ocr::MockOcrEngine;` 으로 share.
|
||||
- **Option B (inline 중복)**: 신 test file `multi_scanned_pdf_ingest_no_chunk_id_collision.rs` 안에 inline `struct LocalMockOcr` + `impl OcrEngine for LocalMockOcr` (test isolation 우선, common/mod.rs touch X).
|
||||
- **Action**:
|
||||
- **(a) dependency probe** — spec §4.5.1 의 결정 path 따름:
|
||||
```bash
|
||||
grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/
|
||||
# 실 결과:
|
||||
# crates/kebab-parse-image/src/ocr.rs:235 — production OllamaVisionOcr.
|
||||
# crates/kebab-app/tests/pdf_ocr_apply.rs:25 — test-only MockOcrEngine.
|
||||
ls crates/kebab-app/tests/common/mod.rs
|
||||
# 실 결과: -rw-r--r-- ... 172 line (TestEnv infrastructure 이미 존재).
|
||||
```
|
||||
- **(b) executor 결정**:
|
||||
- 기존 MockOcrEngine 의 ctor 가 `MockOcrEngine { expected_text: String, fail: bool }` — per-page 다른 text 길이 지원 위해 ctor signature 확장 필요 (예: `expected_text: Vec<String>` + internal `Mutex<usize>` cursor). 확장이 trivial + 두 test 가 같은 crate → **Option A 권장**.
|
||||
- Option A 시 `pdf_ocr_apply.rs` 의 MockOcrEngine ctor 호출 site (현재 실 verifier probe = **10 instantiation site** at lines 140, 170, 193, 210, 242, 284, 311, 334, 359, 399 — critic round 1 L-2 의 "9 → 10" off-by-1 정정. struct define line 21 제외) 가 새 ctor signature 로 migration — backward-compat 위해 두 ctor (`MockOcrEngine::single(text, fail)` + `MockOcrEngine::per_page(texts, fail)`) 제공. **mechanical migration**: 각 site 의 `MockOcrEngine { expected_text: <text>, fail: <bool> }` → `MockOcrEngine::single(<text>, <bool>)` (10 site × 1 line edit, verifier round 1 M-3 의 actual cost).
|
||||
- Option B (inline) 는 sharing 비용 > test 격리 가치 시. 본 plan 의 first preference = Option A.
|
||||
- **(c) 결정 결과 record**: result file (`.omc/reviews/2026-05-27-v0.20-bugfix-plan-drafter-r1c-result.md`) 의 closing summary 의 §6 open question 1 에 결정 path 기록 — Option A 시 sub-action C2 의 file edit = (existing) `common/mod.rs` append 1 line + (new) `common/mock_ocr.rs` + (modify) `pdf_ocr_apply.rs` + (new) `multi_scanned_pdf_ingest_no_chunk_id_collision.rs` = 4 file. Option B 시 1 new file 만.
|
||||
- **Acceptance**:
|
||||
- probe grep 결과 ≥ 2 line (production + existing mock).
|
||||
- probe ls 결과 — `common/mod.rs` existing 확인.
|
||||
- executor 의 결정이 plan 의 §6 open question OQ-1 안에 명시.
|
||||
|
||||
#### Sub-action C2 — integration test 작성 (conditional on C1 결정)
|
||||
|
||||
- **Files affected** (Option A 채택 가정, verifier round 1 H-2 정정):
|
||||
- `crates/kebab-app/tests/common/mod.rs` (**existing** 172 line — `pub mod mock_ocr;` 1줄 append 만).
|
||||
- `crates/kebab-app/tests/common/mock_ocr.rs` (**신규** — MockOcrEngine lift + per-page ctor).
|
||||
- `crates/kebab-app/tests/pdf_ocr_apply.rs:20-45` (기존 inline impl 제거 + `mod common; use common::mock_ocr::MockOcrEngine;` add — file head 의 mod declaration 1 줄 추가) + ctor call site 10 개 mechanical migration (M-3).
|
||||
- `crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs` (**신규**) — `mod common; use common::mock_ocr::MockOcrEngine;` import.
|
||||
- **Action** (spec §4.5 의 test body — 본 plan 의 sub-action 안 expanded):
|
||||
- **fixture**: F1 (`scanned_page1.pdf`, 779 char OCR) + F2 (`scanned_page2.pdf`, 1580 char OCR) + 1 synthetic small-page PDF (300 char) — 3 scanned PDF.
|
||||
- **MockOcrEngine ctor**: per-page text vec `["text for F1", "text for F2 의 1580 char string", "text for synthetic 300 char"]` + `fail: false`.
|
||||
- **isolated KB**: `tempfile::tempdir()` + `Config::default()` 의 `data_dir` 만 override + workspace `[ingest.pdf].enabled = true`.
|
||||
- **assertion path**:
|
||||
1. `kebab_app::ingest_with_config_opts(&cfg, ...)` (facade) 호출.
|
||||
2. `report.items.iter().filter(|i| i.kind == IngestItemKind::Error).count() == 0` — chunk_id collision 시 발생할 `ErrorKind::Storage` row 부재.
|
||||
3. `store.get_chunks_count() == sum(per-PDF chunk_counts)` — DELETE+INSERT path 의 final row count.
|
||||
4. `store.get_all_chunk_ids().iter().collect::<HashSet<_>>().len() == chunks_count` — chunk_id global uniqueness.
|
||||
- **executor degradation path** (spec §4.5.1 conditional downgrade): 만약 Option A 의 share 가 비용/위험 크고 Option B 도 비현실적 (예: integration setup 의 ExtractContext / Facade wiring 가 본 sub-action scope 초과) → §6 row 7 의 acceptance 를 conditional downgrade — `kebab-chunk` 의 unit-level invariant (Step 2 B5) 만으로 Bug #3 의 core regression 핀 확보, integration 회피.
|
||||
- **Acceptance**:
|
||||
- `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision -j 4` green.
|
||||
- `cargo test -p kebab-app pdf_ocr_apply -j 4` green (existing test regression 0 — `MockOcrEngine { expected_text, fail }` literal struct construction 10 ctor site 가 `MockOcrEngine::single(text, fail)` 로 migration 후, critic round 1 L-2 actual count).
|
||||
- downgrade path 시: result file + commit body 안 "§6 row 7 conditional skip — Bug #3 core regression = kebab-chunk unit B5" 1줄 record.
|
||||
|
||||
#### Commit (Step 3 전체)
|
||||
|
||||
```
|
||||
test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)
|
||||
|
||||
- crates/kebab-app/tests/common/{mod,mock_ocr}.rs: MockOcrEngine lift
|
||||
with per-page text ctor (shared by pdf_ocr_apply.rs + new test).
|
||||
- crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs:
|
||||
3 scanned PDF (F1 + F2 + synthetic 300char) ingest via mock OCR,
|
||||
assert all chunk_ids globally unique + zero ErrorKind::Storage rows.
|
||||
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §4.5
|
||||
```
|
||||
|
||||
Option B (inline) 또는 conditional downgrade 채택 시 commit body 와 file list 그에 맞춰 조정.
|
||||
|
||||
---
|
||||
|
||||
### Step 4 (Group D): Bug #4 F4 fixture re-generation
|
||||
|
||||
spec §5 — `tests/fixtures/_synth/mojibake.py` 의 byte-level `re.sub` + 수작업 startxref edit 를 pikepdf 의 proper PDF surgery (open + delete /ToUnicode + save 자동 xref regen) 로 교체. F4 fixture 자체 (`crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf`) regenerate + 3 신규 invariant test.
|
||||
|
||||
#### Sub-action D1 — `tests/fixtures/_synth/mojibake.py` pikepdf rewrite
|
||||
|
||||
- **Files affected**: `tests/fixtures/_synth/mojibake.py` (전체 rewrite — 기존 byte-edit 패턴 폐기).
|
||||
- **Action** (spec §5.4 의 body 그대로):
|
||||
- Step 1: reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
|
||||
- Step 2: pikepdf 로 open + 모든 dictionary 의 `/ToUnicode` entry 제거 + `pdf.save(allow_overwriting_input=True)` (xref 자동 regen).
|
||||
- Step 3: invariant 검증 — `len(pdf.pages) == 1` + `b"/ToUnicode" not in dst.read_bytes()`.
|
||||
- 실패 시 비-zero exit code + stderr message (Step 2 의 removed count = 0 → exit 2; Step 3 의 page count mismatch → exit 3; ToUnicode 잔존 → exit 4).
|
||||
- **Dep install** (executor 의 pre-action):
|
||||
```bash
|
||||
pip install --cache-dir /build/cache/pip pikepdf reportlab
|
||||
python -c "import pikepdf; import reportlab; print(pikepdf.__version__, reportlab.Version)"
|
||||
# font availability probe (critic round 1 L-3) — mojibake.py 의 hardcode path.
|
||||
test -f /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf \
|
||||
|| sudo apt-get install -y fonts-dejavu-core
|
||||
```
|
||||
CI 환경 미반영 — fixture 자체를 commit 하므로 generation 은 1회성 (Step 4 D2 의 executor local). `tasks/HOTFIXES.md` 에 pikepdf install hint 만 1줄 추가 가능.
|
||||
- **Acceptance**:
|
||||
- `grep -c "import pikepdf" tests/fixtures/_synth/mojibake.py` = **1**.
|
||||
- `grep -c "re.sub" tests/fixtures/_synth/mojibake.py` = **0** (byte-edit 패턴 폐기 확인).
|
||||
- `test -f /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf` exit 0 (font probe, critic round 1 L-3 fast failover signal).
|
||||
- `python tests/fixtures/_synth/mojibake.py /tmp/mojibake_dryrun.pdf && echo OK` exit 0 + stderr 무.
|
||||
|
||||
#### Sub-action D2 — F4 fixture binary regenerate + snapshot regen + commit
|
||||
|
||||
- **Files affected** (verifier round 1 H-4 + critic round 1 M-2 의 actual probe `grep -rn 'fixtures/mojibake.pdf' crates/` 결과 2 consumer enumerate):
|
||||
- `crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` (regenerate).
|
||||
- `crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json` (**snapshot baseline file 자체** — delete + auto-regen). verifier round 1 H-4 의 actual probe `text_extractor_regression.rs:59-64` 의 hand-rolled `unwrap_or_else { write baseline }` 패턴.
|
||||
- `crates/kebab-parse-pdf/tests/text_extractor_regression.rs` (existing test — 코드 자체 변경 0, snapshot regen path 만 trigger).
|
||||
- `crates/kebab-parse-pdf/src/text_quality.rs:96` (verifier round 1 H-4 의 2번째 consumer — `let bytes = include_bytes!("../tests/fixtures/mojibake.pdf");` 의 unit test/doctest 가 fixture binary 변경 시 동시 verify, 코드 변경 0).
|
||||
- **Action**:
|
||||
- **(a) regenerate command**:
|
||||
```bash
|
||||
python tests/fixtures/_synth/mojibake.py \
|
||||
crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
|
||||
```
|
||||
- **(b) regenerate 후 manual probe**:
|
||||
```bash
|
||||
python -c "import pikepdf; pdf = pikepdf.open('crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf'); print(len(pdf.pages))"
|
||||
# expected: 1
|
||||
grep -c "/ToUnicode" crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
|
||||
# expected: 0 (binary grep — Pages dict 안 ToUnicode 부재)
|
||||
```
|
||||
- **(c) snapshot baseline regen** (verifier round 1 H-4 + critic round 1 M-2 의 actual mechanic — OQ-2 closure):
|
||||
- `text_extractor_regression.rs:59-64` 는 `let baseline = std::fs::read_to_string("tests/snapshots/vector_pdf_canonical.json").unwrap_or_else(|_| { std::fs::write(baseline_path, &actual).expect(...); actual.clone() })` 의 hand-rolled pattern (insta crate 사용 X).
|
||||
- fixture binary 변경 → 다음 cargo test 시 `actual != baseline` → `assert_eq!` fail.
|
||||
- executor 의 regen step:
|
||||
```bash
|
||||
rm crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json
|
||||
cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4
|
||||
# 1st run: snapshot file 부재 → unwrap_or_else write 패턴이 새 baseline 작성 → assert pass.
|
||||
cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4
|
||||
# 2nd run: 새 baseline 와 byte-identical → assert pass (regression invariant 확립).
|
||||
```
|
||||
- OQ-2 closure — insta crate 미사용, cargo-insta CLI 불요. spec §5.6 의 "기존 `text_extractor_regression.rs` 의 F4 baseline 갱신" 의 actual mechanic 명문화.
|
||||
- **Acceptance**:
|
||||
- `stat crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` size > 0.
|
||||
- `grep -c "/ToUnicode" crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf` = **0**.
|
||||
- python 의 page count probe = `1`.
|
||||
- `test -f crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json` exit 0 (snapshot regen 후).
|
||||
- `cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4` 2회 연속 green (regen + verify, H-4 mechanic).
|
||||
- `cargo test -p kebab-parse-pdf -j 4` 전체 green — D3 의 신규 test + text_quality.rs:96 의 2번째 consumer 도 동시 verify.
|
||||
|
||||
#### Sub-action D3 — parse-pdf 의 3 신규 invariant test
|
||||
|
||||
- **Files affected** (verifier round 1 H-3 의 actual probe 결과 결정):
|
||||
- actual `ls crates/kebab-parse-pdf/tests/` 결과: `common`, `extractor.rs`, `fixtures`, `ocr_e2e.rs`, `page_image.rs`, `snapshots`, `text_extractor_regression.rs` — plan round 0 의 primary candidate `text_extractor.rs` 는 **존재 안 함**.
|
||||
- **결정**: `crates/kebab-parse-pdf/tests/text_extractor_regression.rs` append (F4 fixture consumer locality + D2 snapshot regen mechanic 와 same file). 3 신규 `#[test] fn` append.
|
||||
- 대안 (executor 가 file size / cohesion 고려해 split 결정 시): 신규 `crates/kebab-parse-pdf/tests/mojibake_invariants.rs`. plan first preference = append to `text_extractor_regression.rs`.
|
||||
- **Action** (spec §5.5 의 3 test body 의 path 정정 — verifier round 1 H-3):
|
||||
1. `mojibake_fixture_load_yields_one_page` — `let bytes = include_bytes!("fixtures/mojibake.pdf");` (integration test 는 이미 `crates/kebab-parse-pdf/tests/` root, `text_extractor_regression.rs:42` 의 canonical pattern 따름; spec §5.5 의 `"../tests/fixtures/mojibake.pdf"` 가 잘못 — `"fixtures/mojibake.pdf"` 직접). `lopdf::Document::load_mem(bytes).unwrap().get_pages().len() == 1`.
|
||||
2. `mojibake_fixture_has_no_tounicode_cmap` — CWD-relative `std::fs::read("tests/fixtures/mojibake.pdf")` 위험 회피 (cargo test 의 CARGO_MANIFEST_DIR ≠ CWD 환경 가능): `let bytes = include_bytes!("fixtures/mojibake.pdf");` 사용. `bytes.windows(b"/ToUnicode".len()).filter(|w| *w == b"/ToUnicode").count() == 0`.
|
||||
3. `pdf_text_extractor_on_mojibake_yields_one_block` — `let bytes = include_bytes!("fixtures/mojibake.pdf");` + PdfTextExtractor 의 `1 Block::Paragraph per page` invariant 검증, `canonical.blocks.len() == 1`, `scanned candidate` warning 또는 non-empty text. ExtractContext setup 의 actual body 는 executor 가 `text_extractor_regression.rs` 의 existing helper (있을 시) 또는 spec §5.5 의 placeholder 의 expansion.
|
||||
- **Acceptance**:
|
||||
- `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page -j 4` green.
|
||||
- `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap -j 4` green.
|
||||
- `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block -j 4` green.
|
||||
- `cargo test -p kebab-parse-pdf -j 4` 전체 green.
|
||||
|
||||
#### Commit (Step 4 전체)
|
||||
|
||||
```
|
||||
fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)
|
||||
|
||||
- tests/fixtures/_synth/mojibake.py: full rewrite — replace byte-level
|
||||
re.sub + manual startxref edit with pikepdf open+del+save (auto xref
|
||||
regen). Type 0 font + ToUnicode strip via dictionary walk.
|
||||
- crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf: regenerate.
|
||||
- crates/kebab-parse-pdf/tests/text_extractor_regression.rs: append 3
|
||||
invariant tests (lopdf 1-page / no ToUnicode marker / PdfTextExtractor
|
||||
1-block) — verifier round 1 H-3 의 file path decision (same locality
|
||||
with snapshot regen).
|
||||
- crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json:
|
||||
delete + auto-regen via 2-run cargo test (hand-rolled unwrap_or_else
|
||||
pattern, verifier round 1 H-4).
|
||||
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §5
|
||||
```
|
||||
|
||||
verifier round 1 NIT-2 정정 — commit scope `test-fixture` → `parse-pdf` (crate name, conventional commit typical scope).
|
||||
|
||||
---
|
||||
|
||||
### Step 5 (Group E): Workspace verify + commit + PR #189 force-push
|
||||
|
||||
spec §6 의 16-row consolidated acceptance 를 본 step 의 verifier checklist. 모든 acceptance command 가 scriptable.
|
||||
|
||||
#### Sub-action E1 — cargo workspace test `-j 1`
|
||||
|
||||
- **Files affected**: 변경 0 (verification only).
|
||||
- **Action**:
|
||||
- **(a) conditional cargo clean** — memory `feedback_cargo_clean_policy` (verifier round 1 NIT-1 정정: TB vs GB unit 혼동 회피 위해 `-BG` 으로 GB unit 강제):
|
||||
```bash
|
||||
# /build avail 을 GB 단위 정수로 직접 가져옴 (df -BG output 의 'G' suffix 만 strip).
|
||||
AVAIL_GB=$(df -BG --output=avail /build | tail -1 | tr -d ' G')
|
||||
# CARGO_TARGET_DIR 의 size 도 GB 단위 정수로 (du -BG output).
|
||||
TARGET_GB=$(du -BG -s "${CARGO_TARGET_DIR:-target}" 2>/dev/null | awk '{print $1}' | tr -d 'G')
|
||||
# /build avail < 500 GB OR target > 200 GB → clean
|
||||
if [[ "${AVAIL_GB:-9999}" -lt 500 ]] || [[ "${TARGET_GB:-0}" -gt 200 ]]; then
|
||||
cargo clean
|
||||
fi
|
||||
```
|
||||
임계 미달 시 skip + commit body / result file 안 1줄 record (예: "skipped cargo clean — /build avail ${AVAIL_GB}G, target ${TARGET_GB}G").
|
||||
- **(b) workspace test**:
|
||||
```bash
|
||||
cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -100
|
||||
```
|
||||
- tail 100 line + final summary "test result: ok. N passed; 0 failed" 확인.
|
||||
- **Acceptance**:
|
||||
- exit code 0.
|
||||
- stdout 의 "test result: ok" + "0 failed".
|
||||
- spec §6 row 14 (workspace full test pass) 충족.
|
||||
|
||||
#### Sub-action E2 — `cargo clippy --workspace -- -D warnings`
|
||||
|
||||
- **Files affected**: 변경 0.
|
||||
- **Action**:
|
||||
```bash
|
||||
cargo clippy --workspace --all-targets -j 1 -- -D warnings 2>&1 | tail -50
|
||||
```
|
||||
- **Acceptance**:
|
||||
- exit code 0.
|
||||
- "warning" 키워드 0 (or `-D warnings` 가 자동 error 화).
|
||||
- spec §6 row 13 (workspace clippy clean) 충족.
|
||||
|
||||
#### Sub-action E3 — dogfood re-run (Ollama qwen2.5vl:3b 환경)
|
||||
|
||||
- **Files affected**: 변경 0. `/build/cache/tmp/v0.20-dogfood/` (isolated KB, 동일 dogfood 재사용).
|
||||
- **Action** (memory `feedback_pr_workflow` + `_external/` invariant 따름):
|
||||
- **(a) release build**:
|
||||
```bash
|
||||
cargo build --release -p kebab-cli -j 4 2>&1 | tail -10
|
||||
"${CARGO_TARGET_DIR:-target}/release/kebab" --version
|
||||
# expected: kebab 0.20.0
|
||||
```
|
||||
- **(b) dogfood KB clean + re-ingest 9 PDF** (spec §1.1 의 dogfood 환경 동일).
|
||||
|
||||
canonical config path = `/build/cache/tmp/v0.20-dogfood/config.toml` (§0 가정). 외부 backup file `/build/cache/tmp/v0.20-dogfood-config.toml` 은 **존재 안 함** — critic round 1 H-1 의 actual probe 결과. 따라서 config 의 **자체 backup 후 clean + restore** path 사용 (destructive `rm -rf` 시 config 동시 삭제 방지):
|
||||
|
||||
```bash
|
||||
# Step A: config 의 임시 backup (KB clean 전 보존).
|
||||
cp /build/cache/tmp/v0.20-dogfood/config.toml \
|
||||
/build/cache/tmp/v0.20-dogfood-config.toml.bak
|
||||
|
||||
# Step B: KB 전체 clean (config 포함 destructive — backup 으로 보존됨).
|
||||
rm -rf /build/cache/tmp/v0.20-dogfood/
|
||||
mkdir -p /build/cache/tmp/v0.20-dogfood/
|
||||
|
||||
# Step C: backup 에서 config restore.
|
||||
cp /build/cache/tmp/v0.20-dogfood-config.toml.bak \
|
||||
/build/cache/tmp/v0.20-dogfood/config.toml
|
||||
|
||||
# config.toml 안 [ingest.pdf].enabled = true, ollama endpoint =
|
||||
# http://192.168.0.47:11434, ocr_model = qwen2.5vl:3b
|
||||
|
||||
# Step D: ingest.
|
||||
"$RELEASE_BIN" ingest --config /build/cache/tmp/v0.20-dogfood/config.toml \
|
||||
--json --force-reingest 2>&1 | tee /build/cache/tmp/v0.20-dogfood-ingest.ndjson
|
||||
|
||||
# Step E (optional): backup file cleanup. 다음 dogfood iteration 의 redundant
|
||||
# backup 누적 방지. config 자체는 v0.20-dogfood/config.toml 가 in-place
|
||||
# canonical, .bak 은 transient.
|
||||
rm /build/cache/tmp/v0.20-dogfood-config.toml.bak
|
||||
```
|
||||
|
||||
(대안 selective-delete 의 single-step path — config 보존 + 그 외 destructive):
|
||||
```bash
|
||||
find /build/cache/tmp/v0.20-dogfood/ -mindepth 1 -not -name 'config.toml' \
|
||||
-exec rm -rf {} +
|
||||
```
|
||||
실 procedure 는 plain 5-step 의 명료성 우선 (executor 의 default).
|
||||
- **(c) acceptance**:
|
||||
- spec §6 row 3: 9 PDF 의 `skipped_size_exceeded == 0` for non-code (= 모두 0 — workspace 가 code 0).
|
||||
- spec §6 row 8: F1 + F2 의 `kind != "Error"` (chunk_id collision 부재).
|
||||
- spec §6 row 12: mojibake.pdf 의 ingest item `block_count: 1`.
|
||||
- spec §6 row 15: 9 PDF 모두 ingest, `errors = 2` (encrypted only — pre-existing dogfood baseline 동일).
|
||||
- **Ollama 미가용 시 fallback**: endpoint 가 unreachable 면 본 sub-action 의 partial skip 가능 — workspace test (E1) + clippy (E2) 의 unit/integration 수준 evidence 로 spec §6 row 1, 2, 4-7, 9-11, 13-14, 16 충족 + dogfood row 3, 8, 12, 15 skip 1줄 record (commit body + result file).
|
||||
- **Acceptance**:
|
||||
- ingest report 의 ndjson 안 errors = 2 (encrypted only).
|
||||
- F1/F2/mojibake 각각의 item line `kind` field 가 success path (= `"new"` 또는 `"unchanged"`, not `"Error"`).
|
||||
- dogfood log path: `/build/cache/tmp/v0.20-dogfood-ingest.ndjson` (commit body 안 reference).
|
||||
|
||||
#### Sub-action E4 — commit 점검 + 최종 organize
|
||||
|
||||
- **Files affected**: 모든 step 의 누적 changes.
|
||||
- **Action**:
|
||||
- `git status` + `git log --oneline b4d9e60..HEAD` — Step 1-4 의 4 commit + Step 5 의 verify-only commit 0 (verification 만, commit 없음).
|
||||
- 만약 work-in-progress 잔존 file 있으면 reset.
|
||||
- commit message 의 `Co-Authored-By:` line 점검 (CLAUDE.md gitea-pr workflow).
|
||||
- **Acceptance**:
|
||||
- `git log --oneline b4d9e60..HEAD | wc -l` = **4** (Step 1-4 의 각 1 commit).
|
||||
- `git status` 의 untracked + modified = 0.
|
||||
|
||||
#### Sub-action E5 — PR #189 force-push
|
||||
|
||||
- **Files affected**: remote ref `gitea/feat/pdf-scanned-ocr`.
|
||||
- **Action** (gitea-ops skill 의 직접 호출 가능):
|
||||
```bash
|
||||
git push gitea feat/pdf-scanned-ocr --force-with-lease
|
||||
```
|
||||
- `--force-with-lease` — local 의 fetch state 와 remote HEAD 가 match 시에만 force-push (다른 collaborator 의 push 보호; 본 single-user 환경 cheap safety).
|
||||
- PR #189 의 body 갱신 — Bug #2/#3/#4 fix summary + dogfood evidence 추가 (gitea API `PATCH /repos/altair823-org/kebab/pulls/189`).
|
||||
- 사용자 memory `feedback_pr_workflow` 따라 `gitea-pr-review` skill 의 review 루프 진입 (multi-round critic + verifier).
|
||||
- **Acceptance**:
|
||||
- `gh-equivalent` (gitea-ops `gitea-pr-status 189`) 의 head SHA = local `git rev-parse HEAD`.
|
||||
- PR #189 의 commit count = 이전 force-push 시점 의 commit count + 4.
|
||||
- sequencing summary 의 5-commit table (§7) 와 final state 일치.
|
||||
|
||||
#### Commit (Step 5)
|
||||
|
||||
verification only — git commit 0. Step 1-4 의 4 commit 가 final tree.
|
||||
|
||||
---
|
||||
|
||||
## §4 Verifier checklist (spec §6 16-row 1:1 mapping)
|
||||
|
||||
각 row 가 scriptable command. step 5 E1-E3 의 누적 실행으로 모두 가능.
|
||||
|
||||
| # | Verifier | Bug | step | 명령 |
|
||||
|---|---------|-----|------|------|
|
||||
| 1 | walker bypasses size cap for PDF | #2 | A3 | `cargo test -p kebab-source-fs size_cap_skips_only_code_files -j 4` |
|
||||
| 2 | walker still skips oversized code files | #2 | A3 | `cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes -j 4` |
|
||||
| 3 | 256KB+ PDF/markdown ingest default config | #2 | E3 | dogfood: `$RELEASE_BIN ingest ...` 의 ingest report 의 `skipped_size_exceeded = 0` for non-code |
|
||||
| 4 | chunker collision regression test | #3 | B5 | `cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids -j 4` |
|
||||
| 5 | chunker determinism preserved | #3 | B5 | `cargo test -p kebab-chunk deterministic_chunk_ids_1000 -j 4` |
|
||||
| 6 | chunker overlap clamp preserved | #3 | B5 | `cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target -j 4` |
|
||||
| 7 | integration: multi-scanned PDF ingest (conditional, §4.5.1) | #3 | C2 | `cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision -j 4` (Option A/B downgrade path 시 skip + record) |
|
||||
| 8 | dogfood: F1 + F2 force-reingest errors=0 | #3 | E3 | dogfood: `$RELEASE_BIN ingest --force-reingest ...` 의 errors = 0 (encrypted 제외) |
|
||||
| 9 | F4 fixture lopdf 1-page invariant | #4 | D3 | `cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page -j 4` |
|
||||
| 10 | F4 fixture ToUnicode 부재 invariant | #4 | D3 | `cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap -j 4` |
|
||||
| 11 | F4 PdfTextExtractor 1-block invariant | #4 | D3 | `cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block -j 4` |
|
||||
| 12 | dogfood: F4 ingest block_count=1 | #4 | E3 | dogfood: mojibake.pdf 의 ingest item `block_count: 1` |
|
||||
| 13 | workspace clippy clean | all | E2 | `cargo clippy --workspace --all-targets -j 1 -- -D warnings` |
|
||||
| 14 | workspace full test pass | all | E1 | `cargo test --workspace --no-fail-fast -j 1` |
|
||||
| 15 | dogfood end-to-end 9 PDF | all | E3 | dogfood: 9 PDF 모두 ingest, errors = 2 (encrypted only) |
|
||||
| 16 | chunker_version cascade final value | #3 | B3 | `grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs` 결과가 `"pdf-page-v1.1"` |
|
||||
|
||||
executor 의 final step (E1-E3) 에서 16 row 모두 scriptable 실행 + result file 안 row-by-row pass/fail/skip 기록.
|
||||
|
||||
#### Workspace baseline expected test count delta (verifier round 1 M-1 closure)
|
||||
|
||||
`cargo test --workspace -j 1` (Step 5 E1) 의 expected `test result: ok. N passed` 의 delta 산수 — pre-fix baseline 대비:
|
||||
|
||||
| Step | Sub-action | new test name | crate | type |
|
||||
|---|---|---|---|---|
|
||||
| 1 | A3 | `size_cap_skips_only_code_files` | kebab-source-fs | unit (in `mod tests`) |
|
||||
| 2 | B5 | `multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids` | kebab-chunk | unit (in `mod tests`) |
|
||||
| 3 | C2 | `multi_scanned_pdf_ingest_no_chunk_id_collision` | kebab-app | integration (new test binary) |
|
||||
| 4 | D3 | `mojibake_fixture_load_yields_one_page` | kebab-parse-pdf | unit-style integration |
|
||||
| 4 | D3 | `mojibake_fixture_has_no_tounicode_cmap` | kebab-parse-pdf | unit-style integration |
|
||||
| 4 | D3 | `pdf_text_extractor_on_mojibake_yields_one_block` | kebab-parse-pdf | unit-style integration |
|
||||
|
||||
- **Option A (full path, C2 active)**: total = **+6 unit/integration test cases + 1 new integration test binary**.
|
||||
- **Option B (C2 conditional downgrade per §4.5.1)**: total = **+5 test cases + 0 new binary**.
|
||||
|
||||
기타: B3 의 `chunker_version_is_pdf_page_v1` 기존 test 의 assertion content 변경 없음 (VERSION_LABEL const 인용) → test count delta 0. D2 의 `vector_pdf_extract_byte_identical_to_baseline` 기존 test 의 assertion 결과만 변경 (fixture 변경 → baseline 변경) → test count delta 0, snapshot regen action 만 추가.
|
||||
|
||||
executor 가 E1 acceptance 의 N 비교 시 본 delta 산수 와 일치 확인 (regression 시 detection).
|
||||
|
||||
---
|
||||
|
||||
## §5 Risks (plan 단계)
|
||||
|
||||
- **R-1 (MockOcrEngine sharing complexity)**: spec §4.5.1 의 Option A (`tests/common/mock_ocr.rs` lift) 가 기존 `pdf_ocr_apply.rs:20-45` 의 9 test 의 ctor migration 필요 — backward-compat ctor 2 개 (single + per_page) 도입 시 trivial, 실패 시 Option B (inline) downgrade. spec §6 row 7 conditional skip 가능.
|
||||
- **R-2 (chunker_version bump cascade scope)**: `pdf-page-v1.1` 의 영향 = multi-chunk PDF page 의 chunk_id 변경. parser_version / embedding_version / prompt_template_version / index_version unchanged — `kebab-eval::eval_runs.config_snapshot_json` 의 5-version snapshot 의 chunker_version field 만 새 값. parent design §9 의 cascade rule invariant 보존, eval baseline 의 re-run 권장 (spec §7.1 Risk 1 의 user-facing note).
|
||||
- **R-3 (F4 fixture binary churn)**: pikepdf 의 save output 가 reportlab+byte-edit 와 다른 PDF object ordering → SHA256 변경 + git binary diff noise. `text_extractor_regression.rs` baseline 도 새 fixture 의 actual output 으로 same-commit update — Step 4 D2 안 동시 처리.
|
||||
- **R-4 (dogfood Ollama 의존)**: spec §6 row 3 + 8 + 12 + 15 dogfood acceptance 가 real `192.168.0.47:11434` qwen2.5vl:3b 호출. endpoint 미가용 시 unit/integration evidence (row 1-2, 4-7, 9-11, 13-14, 16) 로 partial closure + commit body / result file 안 skip record.
|
||||
- **R-5 (pikepdf dependency install)**: Step 4 D1 의 mojibake.py 의 `import pikepdf` — 본 머신의 Python venv 에 pip install 필요. CI 의존성 미발생 (fixture commit 후 1회성 generation).
|
||||
- **R-6 (parent plan 와의 동시 진행 충돌 0 확인)**: parent plan (`2026-05-27-pdf-scanned-ocr-plan.md` round 1c ACCEPT) 의 Step 11 (final verify + PR open) 가 이미 commit `b4d9e60` 으로 closed. 본 plan 의 fix commits 가 그 commit 위에 stack — branch ordering 충돌 0.
|
||||
|
||||
---
|
||||
|
||||
## §6 Open questions deferred to executor
|
||||
|
||||
- **OQ-1 (MockOcrEngine sharing path)**: spec §4.5.1 의 Option A (`tests/common/mock_ocr.rs` lift) vs Option B (inline) 결정. executor 의 Step 3 C1 안 first action — probe `grep -rn "impl OcrEngine"` 후 결정 + result file 안 record. plan first preference = Option A.
|
||||
- **OQ-2 (F4 baseline snapshot update tool)**: ✅ **CLOSED (round 1c, critic M-2 + verifier H-4)** — `text_extractor_regression.rs:59-64` 의 actual pattern = hand-rolled `unwrap_or_else { write baseline }` (insta crate 사용 X). regen procedure = snapshot file `tests/snapshots/vector_pdf_canonical.json` 삭제 + cargo test 2회 (1st auto-regen, 2nd verify). cargo-insta CLI 불요. detail = §3 Step 4 D2 의 Action (c).
|
||||
- **OQ-3 (pikepdf install command)**: `pip install` 의 cache-dir + venv 결정 — global `--user` pip 또는 fixture generation 전용 venv 또는 conda environment. plan 의 default = `pip install --cache-dir /build/cache/pip pikepdf reportlab` (memory `feedback_disk_layout`).
|
||||
- **OQ-4 (dogfood config.toml 의 endpoint 변경 시점)**: 본 dogfood 환경의 `192.168.0.47:11434` Ollama endpoint 가 변경되면 executor 가 alternative endpoint (`localhost:11434` 등) 로 override + result file 안 record.
|
||||
- **OQ-5 (PR #189 review 루프의 round 수)**: memory `feedback_pr_workflow` 의 gitea-pr + 리뷰 루프 — round 1 critic + verifier 의 결과에 따라 round 2/2c 진입 가능. 본 plan 은 round 0 (drafter) — review round 의 outcome 은 plan 외 scope.
|
||||
|
||||
---
|
||||
|
||||
## §7 Sequencing summary (logical commit boundaries)
|
||||
|
||||
| commit # | step range | logical scope | file count |
|
||||
|---:|---|---|---:|
|
||||
| 1 | Step 1 (A1+A2+A3) | `fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)` | 2 |
|
||||
| 2 | Step 2 (B1+B2+B3+B4+B5) | `fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)` | 4 (pdf_page_v1.rs + HOTFIXES.md + pdf_pipeline.rs:168 + :368, verifier H-1) |
|
||||
| 3 | Step 3 (C1+C2) | `test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)` | **4 (Option A: existing common/mod.rs append + new common/mock_ocr.rs + modify pdf_ocr_apply.rs + new multi_scanned_pdf_ingest_no_chunk_id_collision.rs, verifier H-2)** / 1 (Option B) |
|
||||
| 4 | Step 4 (D1+D2+D3) | `fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)` | **5 (mojibake.py + fixtures/mojibake.pdf + snapshots/vector_pdf_canonical.json + text_extractor_regression.rs (D3 append) + src/text_quality.rs:96 consumer verify, verifier H-4 + H-3 + NIT-2)** |
|
||||
| 5 | Step 5 (E1-E5) | verification only — git commit 0; final state = commits 1-4 위 PR #189 force-push | 0 |
|
||||
|
||||
총 4 commit + 1 verify-only step. force-push 후 PR #189 의 head = local HEAD.
|
||||
|
||||
---
|
||||
|
||||
## §8 Round 1c rewrite changelog (drafter trace)
|
||||
|
||||
round 1 critic + verifier 의 합산 21 finding (critic 7 + verifier 14) 적용. detail 은 result file (`.omc/reviews/2026-05-27-v0.20-bugfix-plan-drafter-r1c-result.md`) 의 §1 traceability matrix 참조. 본 §8 은 plan body 의 substantive change summary.
|
||||
|
||||
### Critic r1 (7 finding)
|
||||
|
||||
| ID | Severity | Action | Plan section |
|
||||
|---|---|---|---|
|
||||
| critic H-1 | HIGH | E3 dogfood config 의 backup 후 clean + restore 5-step procedure (외부 backup file 부재 reality 반영) | §3 Step 5 E3 (b) |
|
||||
| critic M-1 | MEDIUM | line 15 "17 sub-action" → "18 sub-action" | §0 prelude line |
|
||||
| critic M-2 | MEDIUM | D2 snapshot baseline 갱신 mechanic 명문 (hand-rolled `unwrap_or_else` pattern, OQ-2 closure) | §3 Step 4 D2 + §6 OQ-2 |
|
||||
| critic L-1 | LOW | B1 line range "200-289" → "200-204 (doc) + 205-289 (body)" 명시 | §3 Step 2 B1 |
|
||||
| critic L-2 | LOW | MockOcrEngine ctor count "9 test (existing)" → "10 instantiation site" (actual probe) | §3 Step 3 C1 + C2 |
|
||||
| critic L-3 | LOW | D1 pre-action 에 DejaVuSans.ttf existence probe 1줄 추가 | §3 Step 4 D1 |
|
||||
| critic NIT-1 | NIT | "5 logical commit" → "4 commit + 1 verify-only step (= 5 step total, 4 commit boundary)" | §0 prelude line |
|
||||
|
||||
### Verifier r1 (14 finding)
|
||||
|
||||
| ID | Severity | Action | Plan section |
|
||||
|---|---|---|---|
|
||||
| verifier H-1 | HIGH | B3 sub-action 에 `pdf_pipeline.rs:168` (hard assertion) + `:368` (error message) literal 갱신 명시 + acceptance grep regex 정밀화 (`grep -v 'pdf-page-v1\.1'`) | §3 Step 2 B3 |
|
||||
| verifier H-2 | HIGH | Step 3 Option A 의 `common/mod.rs` 가 existing infrastructure 반영 — `pub mod mock_ocr;` 1줄 append + 신규 `common/mock_ocr.rs` + `pdf_ocr_apply.rs` lift + 신규 integration test = 4 file edit | §3 Step 3 C1 + C2 + §7 commit 3 file count |
|
||||
| verifier H-3 | HIGH | D3 file path `text_extractor.rs` 부재 정정 → `text_extractor_regression.rs` append (locality with D2 snapshot regen). `include_bytes!` path 도 `../tests/fixtures/...` → `fixtures/...` 직접 + CWD-relative `std::fs::read` 회피 | §3 Step 4 D3 |
|
||||
| verifier H-4 | HIGH | D2 snapshot regen mechanic — snapshot file `tests/snapshots/vector_pdf_canonical.json` 삭제 + cargo test 2회 (1st auto-regen, 2nd verify) + `src/text_quality.rs:96` 2번째 consumer enumerate | §3 Step 4 D2 + §7 commit 4 file count |
|
||||
| verifier M-1 | MEDIUM | §4 verifier checklist 뒤에 expected workspace test count delta 산수 표 추가 (+6 unit + 1 integration, Option A / +5 + 0, Option B) | §4 (sub-section) |
|
||||
| verifier M-2 | MEDIUM | B2 acceptance phrasing 갱신 — "Step 2 commit time" 명시 + sub-action 별 grep 시점 명문 | §3 Step 2 B2 acceptance |
|
||||
| verifier M-3 | MEDIUM | C2 Option A 의 "기존 10 ctor site mechanical migration" 명령 명시 | §3 Step 3 C1 (b) |
|
||||
| verifier L-1 | LOW | pdf_page_v1.rs line range 200-289 → 205-289 (critic L-1 와 same edit pass) | §3 Step 2 B1 |
|
||||
| verifier L-2 | LOW | caller line range 155-185 → 149-186 | §3 Step 2 B2 |
|
||||
| verifier L-3 | LOW | B5 test scenario comment 의 target=1500 byte + overlap=240 byte 산수 보강 | §3 Step 2 B5 |
|
||||
| verifier L-4 | LOW | `ingest_pdf_ocr_smoke.rs` 의 grep B3 scope safety 확인 (별도 action 0, finding 자체 = no action) | (verified safe) |
|
||||
| verifier NIT-1 | NIT | E1 의 `df -h` unit 처리 산수 정밀화 → `df -BG --output=avail` 으로 GB unit 강제 | §3 Step 5 E1 (a) |
|
||||
| verifier NIT-2 | NIT | Step 4 commit scope `test-fixture` → `parse-pdf` (crate name) | §3 Step 4 commit |
|
||||
| verifier NIT-3 | NIT | dogfood config canonical path 의 single-definition (in §0) + 모든 acceptance command 참조 | §0 pre-flight + §3 Step 5 E3 |
|
||||
|
||||
### Summary
|
||||
|
||||
- frontmatter `status` `draft (round 0)` → `draft (round 1c)`.
|
||||
- frontmatter `review_history` 에 round 1 critic + verifier + round 1c rewrite 항목 3 줄 add.
|
||||
- plan body line 15 의 prelude statement 2 token 정정 (sub-action count + commit boundary 표현).
|
||||
- §0 pre-flight 에 dogfood KB layout 가정 1 bullet add.
|
||||
- §3 5 step 의 sub-action body 의 detail 보강 (file path / acceptance grep / mechanic / migration cost).
|
||||
- §4 verifier checklist 의 expected test count delta sub-section add.
|
||||
- §6 OQ-2 closure 표시 (✅ CLOSED).
|
||||
- §7 sequencing summary 의 file count 갱신 (commit 2: 2→4, commit 3: 3-4→4, commit 4: 3-4→5).
|
||||
- §8 round 1c rewrite changelog (본 단락) populate.
|
||||
|
||||
총 plan body line 변경 = ~+250 net add (round 0 698 line → round 1c ~950 line).
|
||||
388
docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md
Normal file
388
docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix2-plan.md
Normal file
@@ -0,0 +1,388 @@
|
||||
---
|
||||
title: "v0.20.0 sub-item 1 bugfix round 2 — plan"
|
||||
created: 2026-05-27
|
||||
status: "DRAFT round 0"
|
||||
spec_path: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
|
||||
spec_status: ACCEPT (round 1c, 308 line)
|
||||
critic_round_1: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md
|
||||
critic_round_2: .omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r2-result.md
|
||||
step_count: 4 (3 commit + 1 sanity-only)
|
||||
commit_count: 3
|
||||
branch: feat/pdf-scanned-ocr
|
||||
head_at_draft: e674ff4
|
||||
---
|
||||
|
||||
# v0.20.0 sub-item 1 bugfix round 2 — plan
|
||||
|
||||
## §0 Overview
|
||||
|
||||
Spec ACCEPT (round 1c, 9/9 critic finding 반영) 의 implementation map. fix scope = 2 bug:
|
||||
|
||||
- **Bug #6** (Critical): `?Identity-H Unimplemented?` mojibake marker bypass — `crates/kebab-parse-pdf/src/text_quality.rs::compute_valid_char_ratio()` 의 marker strip + dominance heuristic.
|
||||
- **Bug #7** (Minor doc): `kebab search --help` 의 `--media` value list 에서 `code` 누락 — clap doc-comment + SKILL.md 동기 갱신 + CLI help regression test.
|
||||
|
||||
총 변경 file 5 + 신규 test file 1 + HOTFIXES entry. branch = `feat/pdf-scanned-ocr` (HEAD = `e674ff4`, round 1 의 4 commit 적층 위에 round 2 의 3 commit append). env = `CARGO_TARGET_DIR=/build/out/cargo-target/target`. fresh release binary = `/build/out/cargo-target/target/release/kebab`.
|
||||
|
||||
**non-scope (spec §2.2 + 본 plan §5 OQ-4)**: spec ACCEPT 의 surface 외에 `crates/kebab-mcp/src/tools/search.rs:44`, `crates/kebab-core/src/search.rs:32+52`, `crates/kebab-app/src/ingest_progress.rs:69`, `crates/kebab-cli/tests/wire_schema_breakdowns.rs:35` 가 같은 stale value list (`markdown, pdf, image, audio, other` — `code` 누락) 을 보유. spec 의 frozen grep boundary (`integrations/` + `crates/kebab-cli/src` + `docs/wire-schema/v1`) 외이므로 본 round 의 commit 대상 X — follow-up issue 권장.
|
||||
|
||||
---
|
||||
|
||||
## §1 Step table
|
||||
|
||||
| Step | Title | Scope summary | Commit subject | Files touched |
|
||||
|------|-------|----------------|----------------|----------------|
|
||||
| 1 | Bug #6 implementation | `MOJIBAKE_MARKERS` const + `compute_valid_char_ratio()` rewrite + 2 new unit test + HOTFIXES entry | `fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)` | `crates/kebab-parse-pdf/src/text_quality.rs`, `tasks/HOTFIXES.md` |
|
||||
| 2 | Bug #7 doc-comment + SKILL.md | clap doc-comment 의 `--media` value list 에 `code` 추가 + SKILL.md line 57 동기 | `docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)` | `crates/kebab-cli/src/main.rs`, `integrations/claude-code/kebab/SKILL.md` |
|
||||
| 3 | Bug #7 CLI help assertion | 신규 test file `crates/kebab-cli/tests/cli_help_smoke.rs` 의 `search_help_lists_code_in_media_values` test | `test(cli): assert 'code' in search --help output (Bug #7 regression pin)` | `crates/kebab-cli/tests/cli_help_smoke.rs` (신규) |
|
||||
| 4 | Final sanity (no commit) | workspace test + workspace clippy + optional dogfood retest | — | none |
|
||||
|
||||
---
|
||||
|
||||
## §2 Per-step detail
|
||||
|
||||
### Step 1 — Bug #6 implementation
|
||||
|
||||
#### §2.1 Files affected
|
||||
|
||||
- `crates/kebab-parse-pdf/src/text_quality.rs` (현재 103 line — line 1-37 body, line 39-103 tests).
|
||||
- `tasks/HOTFIXES.md` (dated entry append).
|
||||
|
||||
#### §2.2 Action
|
||||
|
||||
**§2.2.1** — `text_quality.rs` line 1-18 (file header comment + `compute_valid_char_ratio` body) **rewrite** per spec §4.1 의 diff. 추가:
|
||||
|
||||
- 새 const `MOJIBAKE_MARKERS: &[&str] = &["?Identity-H Unimplemented?"]` (line 8-12 위치, lopdf 0.32.0 source 추적 comment 포함).
|
||||
- `compute_valid_char_ratio()` body 의 4-단계 marker strip → trim-empty zero → dominance cap-0.3 → 기존 ratio 계산.
|
||||
- `is_valid_text_char()` (line 20-37) **변경 없음** (signature + range list 보존).
|
||||
|
||||
**§2.2.2** — `text_quality.rs::tests` module (line 39-103) 에 2 신규 test **append**:
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn identity_h_marker_dominance_caps_ratio_below_threshold() {
|
||||
let s = format!("Page 1 of 5 {}", "?Identity-H Unimplemented?".repeat(20));
|
||||
let r = compute_valid_char_ratio(&s);
|
||||
assert!(r <= 0.3, "marker-dominant mixed page → ratio ≤ 0.3 (OCR fallback); got {r}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn identity_h_marker_minority_with_long_valid_text_keeps_high_ratio() {
|
||||
let header = "x".repeat(200);
|
||||
let s = format!("{header} ?Identity-H Unimplemented?");
|
||||
let r = compute_valid_char_ratio(&s);
|
||||
assert!(r > 0.9, "marker-minority page keeps high ratio; got {r}");
|
||||
}
|
||||
```
|
||||
|
||||
**중요 — 스펙 §4.2 wording 보정 (critic r2 NEW-1)**: spec §4.2 의 "Replace existing Bug #6 test set with two new tests" 는 stale wording. 현 `text_quality.rs::tests` 는 8 test 보유, **Identity-H marker 관련 test 0**. 즉 net change = **+2 / -0**. brief §2.1 의 "기존 test `identity_h_marker_mixed_with_some_real_text_low_ratio` 제거" 도 동일 stale — 해당 test 미존재. executor 는 8 existing test (`empty_string_zero`, `pure_ascii_one`, `pure_hangul_syllables_one`, `pure_pua_zero`, `mixed_half`, `cjk_ideograph_valid`, `hangul_jamo_valid`, `f4_fixture_ratio_under_threshold`) 모두 **보존**.
|
||||
|
||||
**§2.2.3** — `tasks/HOTFIXES.md` 의 latest dated section 위에 entry append:
|
||||
|
||||
```markdown
|
||||
## 2026-05-27 — Identity-H mojibake marker bypassed OCR fallback (Bug #6)
|
||||
|
||||
- **Symptom**: `metro-korea.pdf` (Identity-H CID font without ToUnicode CMap) 의 ingest 가 `pdf_ocr_pages=0` 으로 종료. text 전체가 `?Identity-H Unimplemented?` marker 1154회 반복 (lopdf 0.32.0 emit). text-detect ratio = 1.0 → OCR fallback threshold 0.5 bypass.
|
||||
- **Root cause**: `crates/kebab-parse-pdf/src/text_quality.rs::compute_valid_char_ratio()` 의 `is_valid_text_char()` 가 ASCII printable range (0x0020..=0x007E) 를 unconditional valid 처리. marker (28 ASCII char) 는 valid 로 count.
|
||||
- **Fix**: `MOJIBAKE_MARKERS` const 도입 + marker strip after-strip 의 trim-empty → 0.0 + dominance heuristic (strip > 잔여 일 때 cap 0.3). spec ACCEPT: `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md` §4.1. parser_version/wire schema 영향 0.
|
||||
- **User action**: 이미 `metro-korea.pdf` class 의 mojibake-heavy PDF 를 v0.20.0 pre-bugfix2 binary 로 indexed 한 경우, `kebab ingest --force-reingest <workspace>` 로 cached skip 무효화 필요 (release notes 동등 안내).
|
||||
```
|
||||
|
||||
#### §2.3 Acceptance
|
||||
|
||||
actionable verify command (per-step):
|
||||
|
||||
```bash
|
||||
# A) text_quality 신규 test 2 + 기존 8 = 10 모두 green
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test -p kebab-parse-pdf text_quality -j 4 2>&1 | tail -10
|
||||
|
||||
# B) parse-pdf crate clean compile
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-parse-pdf -j 4 2>&1 | tail -3
|
||||
|
||||
# C) parse-pdf clippy clean (-D warnings)
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy -p kebab-parse-pdf --all-targets -j 4 -- -D warnings 2>&1 | tail -5
|
||||
```
|
||||
|
||||
기대: A 의 tail = `test result: ok. 10 passed; 0 failed`, B = `Finished`, C = warning 0.
|
||||
|
||||
#### §2.4 Commit
|
||||
|
||||
```bash
|
||||
git add crates/kebab-parse-pdf/src/text_quality.rs tasks/HOTFIXES.md
|
||||
git commit -m "$(cat <<'EOF'
|
||||
fix(parse-pdf): strip Identity-H Unimplemented marker + dominance heuristic in compute_valid_char_ratio (Bug #6)
|
||||
|
||||
Why: metro-korea.pdf (Identity-H CID font without ToUnicode CMap) 의
|
||||
ingest 가 pdf_ocr_pages=0 으로 잘못 종료. lopdf 0.32.0 의 emit
|
||||
`?Identity-H Unimplemented?` marker 28 ASCII char 가 is_valid_text_char()
|
||||
의 0x0020..=0x007E range 통과 → ratio=1.0 → OCR fallback 0.5
|
||||
threshold bypass.
|
||||
|
||||
Change: MOJIBAKE_MARKERS const + compute_valid_char_ratio() 4-단계
|
||||
(strip → trim-empty zero → dominance cap-0.3 → 기존 ratio). marker
|
||||
list extensible. is_valid_text_char() 본체 변경 0.
|
||||
|
||||
Tests: +2 unit (dominance + minority) on top of 기존 8. parser_version
|
||||
/ wire schema 변경 0.
|
||||
|
||||
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
|
||||
§4.1 / §4.2 / §6 R-1.
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — Bug #7 doc-comment + SKILL.md
|
||||
|
||||
#### §2.5 Files affected
|
||||
|
||||
- `crates/kebab-cli/src/main.rs` line 158-160 (실측: SearchArgs `media` 의 3-line clap doc-comment).
|
||||
- `integrations/claude-code/kebab/SKILL.md` line 57.
|
||||
|
||||
#### §2.6 Action
|
||||
|
||||
**§2.6.1** — `crates/kebab-cli/src/main.rs` line 158-160 의 doc-comment edit:
|
||||
|
||||
```diff
|
||||
/// p9-fb-36: filter by `assets.media_type` kind. Comma-separated.
|
||||
- /// Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`,
|
||||
- /// `image`, `audio`, `other`. Unknown values match nothing.
|
||||
+ /// Aliases: `md` → `markdown`. Other accepted: `markdown`, `pdf`,
|
||||
+ /// `image`, `audio`, `code`, `other`. Unknown values match nothing.
|
||||
```
|
||||
|
||||
(critic r2 NEW-2 보정: spec §4.3 의 1-line 표기 vs 실제 3-line clap doc-comment 차이. 실제 file 의 multi-line 분포 그대로 유지하며 line 160 의 `image`, `audio` 사이에 `code` 삽입.)
|
||||
|
||||
**§2.6.2** — `integrations/claude-code/kebab/SKILL.md` line 57 의 edit:
|
||||
|
||||
```diff
|
||||
-`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"other"`; alias `"md"` → `"markdown"`)
|
||||
+`media` (string array — IN-list of `"markdown"` | `"pdf"` | `"image"` | `"audio"` | `"code"` | `"other"`; alias `"md"` → `"markdown"`)
|
||||
```
|
||||
|
||||
#### §2.7 Acceptance
|
||||
|
||||
```bash
|
||||
# A) cli crate clean compile (doc-comment edit — compile 영향 0 기대)
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-cli -j 4 2>&1 | tail -3
|
||||
|
||||
# B) SKILL.md 의 `code` substring grep
|
||||
grep -nF '"code"' integrations/claude-code/kebab/SKILL.md
|
||||
|
||||
# C) fresh binary 의 search --help 가 `code` 노출
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build --release -p kebab-cli -j 4 2>&1 | tail -3
|
||||
/build/out/cargo-target/target/release/kebab search --help 2>&1 | grep -F 'code'
|
||||
```
|
||||
|
||||
기대: A = `Finished`, B = line 57 1건 hit, C = `code` 포함 1+ line.
|
||||
|
||||
#### §2.8 Commit
|
||||
|
||||
```bash
|
||||
git add crates/kebab-cli/src/main.rs integrations/claude-code/kebab/SKILL.md
|
||||
git commit -m "$(cat <<'EOF'
|
||||
docs(cli): list 'code' in --media help string + SKILL.md (Bug #7)
|
||||
|
||||
Why: kebab search --media code 가 v0.18.0 부터 functional support 됨
|
||||
(MEDIA_KINDS 외 path 로 first-class 처리, schema.v1.media_breakdown.code
|
||||
존재). 그러나 SearchArgs 의 clap doc-comment + SKILL.md line 57 의
|
||||
value list 가 stale — `code` 누락. user 가 --help 만 보고 code 미지원이라
|
||||
오해 가능.
|
||||
|
||||
Change: 2 surface 동기 — main.rs line 158-160 의 multi-line clap
|
||||
doc-comment + integrations/claude-code/kebab/SKILL.md line 57.
|
||||
Rust binary surface / wire schema 변경 0.
|
||||
|
||||
Out of scope (follow-up): crates/kebab-mcp/tools/search.rs:44,
|
||||
crates/kebab-core/src/search.rs:32+52, crates/kebab-app/src/
|
||||
ingest_progress.rs:69, crates/kebab-cli/tests/wire_schema_breakdowns.rs:35
|
||||
도 동일 stale list 보유. spec ACCEPT (round 1c) 의 grep boundary
|
||||
밖이므로 본 round 미포함.
|
||||
|
||||
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md
|
||||
§4.3 / §4.3a.
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Bug #7 CLI help assertion test
|
||||
|
||||
#### §2.9 Files affected
|
||||
|
||||
- `crates/kebab-cli/tests/cli_help_smoke.rs` (신규 — 기존 file list 에 미존재).
|
||||
|
||||
#### §2.10 Action
|
||||
|
||||
신규 file 생성. 기존 test convention (`cli_*` prefix, `Command::new(env!("CARGO_BIN_EXE_kebab"))` pattern — 참고: `cli_readonly_quiet.rs`, `cli_schema.rs`) 답습:
|
||||
|
||||
```rust
|
||||
// crates/kebab-cli/tests/cli_help_smoke.rs
|
||||
//
|
||||
// Regression pin — `kebab search --help` 의 `--media` value list 가
|
||||
// `code` 를 노출. Bug #7 (v0.20.0 bugfix round 2 spec §4.4).
|
||||
|
||||
#[test]
|
||||
fn search_help_lists_code_in_media_values() {
|
||||
let out = std::process::Command::new(env!("CARGO_BIN_EXE_kebab"))
|
||||
.args(["search", "--help"])
|
||||
.output()
|
||||
.expect("kebab search --help");
|
||||
let stdout = String::from_utf8_lossy(&out.stdout);
|
||||
assert!(
|
||||
stdout.contains("`code`"),
|
||||
"search --help must list 'code' as accepted --media value; stdout = {stdout}"
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
#### §2.11 Acceptance
|
||||
|
||||
```bash
|
||||
# A) 신규 test target 빌드 + 실행
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test -p kebab-cli --test cli_help_smoke -j 4 2>&1 | tail -10
|
||||
|
||||
# B) cli crate tests target clean compile (전체)
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo build -p kebab-cli --tests -j 4 2>&1 | tail -3
|
||||
|
||||
# C) cli clippy clean (-D warnings) — 신규 test file 포함
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy -p kebab-cli --all-targets -j 4 -- -D warnings 2>&1 | tail -5
|
||||
```
|
||||
|
||||
기대: A = `test result: ok. 1 passed; 0 failed`, B = `Finished`, C = warning 0.
|
||||
|
||||
#### §2.12 Commit
|
||||
|
||||
```bash
|
||||
git add crates/kebab-cli/tests/cli_help_smoke.rs
|
||||
git commit -m "$(cat <<'EOF'
|
||||
test(cli): assert 'code' in search --help output (Bug #7 regression pin)
|
||||
|
||||
Why: Step 2 의 doc-comment edit 가 향후 누군가 value list 를 재정렬
|
||||
하거나 alias section 으로 분리할 때 silently 사라질 risk. clap 의
|
||||
--help 렌더링 가 doc-comment 의 free-form text 라 grep-only smoke 가
|
||||
유일한 검출 수단.
|
||||
|
||||
Change: 신규 test file (kebab-cli convention `cli_*` prefix 답습).
|
||||
CARGO_BIN_EXE_kebab 으로 fresh binary 실행, stdout 의 `code` substring
|
||||
assert. spec §4.4 의 acceptance row 1:1 mapping.
|
||||
|
||||
Refs: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md §4.4
|
||||
/ §5 (acceptance row 4).
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 4 — Final sanity (no commit)
|
||||
|
||||
#### §2.13 Scope
|
||||
|
||||
3 commit append 후 workspace 전수 verify + optional dogfood. **commit 발생 X** (코드 변경 0 — verification only).
|
||||
|
||||
#### §2.14 Acceptance
|
||||
|
||||
```bash
|
||||
# A) workspace test 전수 — 기존 1316 + 본 round +2 unit + +1 cli = 1319 expected
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo test --workspace --no-fail-fast -j 1 2>&1 | tee /tmp/v0.20-bugfix2-test.log | tail -15
|
||||
echo "exit = ${PIPESTATUS[0]:-$?}"
|
||||
|
||||
# B) workspace clippy clean (-D warnings)
|
||||
CARGO_TARGET_DIR=/build/out/cargo-target/target cargo clippy --workspace --all-targets -j 4 -- -D warnings 2>&1 | tail -8
|
||||
|
||||
# C) (optional) dogfood retest — metro-korea.pdf
|
||||
# binary 의 fresh build 가 이미 Step 2 acceptance 에서 완료.
|
||||
# --force-reingest 후 pdf_ocr_pages 가 0 → 21+ 변화 관찰.
|
||||
# OCR latency ≈ 10 min cost — plan drafter 가 executor 에게 optional 명시.
|
||||
# 실측 corpus 가 user-private (KEBAB_WORKSPACE 또는 ~/Documents/test/) 이면 skip 가능.
|
||||
```
|
||||
|
||||
기대: A = `test result: ok. <N> passed; 0 failed` (N ≥ 1319), B = warning 0, C = 사용자 선택 (verifier round 0 에서 평가).
|
||||
|
||||
#### §2.15 Commit
|
||||
|
||||
없음 (sanity-only). executor 가 sanity green 확인 후 PR push 단계로 진행.
|
||||
|
||||
---
|
||||
|
||||
## §3 Verifier checklist (cumulative)
|
||||
|
||||
spec §5 의 7 row acceptance criteria 와 1:1 mapping. verifier round 0 의 actionable command:
|
||||
|
||||
| # | Spec §5 criterion | Verifier command | Step coverage | Pass condition |
|
||||
|---|-------------------|------------------|---------------|----------------|
|
||||
| 1 | `identity_h_marker_dominance_caps_ratio_below_threshold` green | `cargo test -p kebab-parse-pdf identity_h_marker_dominance_caps_ratio_below_threshold -j 4 2>&1 \| tail -3` | Step 1 | `1 passed; 0 failed` |
|
||||
| 2 | `identity_h_marker_minority_with_long_valid_text_keeps_high_ratio` green | `cargo test -p kebab-parse-pdf identity_h_marker_minority_with_long_valid_text_keeps_high_ratio -j 4 2>&1 \| tail -3` | Step 1 | `1 passed; 0 failed` |
|
||||
| 3 | 기존 text_quality 8 test green (regression 0) | `cargo test -p kebab-parse-pdf text_quality -j 4 2>&1 \| tail -5` | Step 1 | `10 passed; 0 failed` (8 기존 + 2 신규) |
|
||||
| 4 | `search_help_lists_code_in_media_values` green | `cargo test -p kebab-cli --test cli_help_smoke -j 4 2>&1 \| tail -3` | Step 3 | `1 passed; 0 failed` |
|
||||
| 5 | SKILL.md 의 `"code"` substring 존재 | `grep -nF '"code"' integrations/claude-code/kebab/SKILL.md` | Step 2 | line 57 1 hit |
|
||||
| 6 | workspace test 전수 green | `cargo test --workspace --no-fail-fast -j 1 2>&1 \| tail -10` | Step 4 | `0 failed`, N ≥ 1319 |
|
||||
| 7 | workspace clippy clean (-D warnings) | `cargo clippy --workspace --all-targets -j 4 -- -D warnings 2>&1 \| tail -5` | Step 4 | warning 0 |
|
||||
| 8 (optional) | dogfood retest — metro-korea.pdf 의 `pdf_ocr_pages` 0 → 21+ | manual: `kebab ingest --force-reingest <ws>` 후 ingest_report.v1 의 `items[].pdf_ocr_pages` 검사 | Step 4 | `pdf_ocr_pages > 0` for metro-korea.pdf row |
|
||||
|
||||
executor 는 row 1-7 모두 green 시 PR push gate 통과. row 8 = verifier round 0 의 optional (사용자 corpus 가용성 + 10 min cost 평가).
|
||||
|
||||
---
|
||||
|
||||
## §4 Risks resolution (spec §6 의 plan-level)
|
||||
|
||||
| ID | Spec §6 status | Plan-level action |
|
||||
|----|----------------|--------------------|
|
||||
| R-1 | resolved per critic r1 (lopdf 0.32.0 = marker 1 entry) | 본 plan §2.2.1 의 source comment 가 lopdf upgrade 시 re-verify trigger. |
|
||||
| R-2 | resolved (`trim().is_empty()` cover) | Step 1 implementation 의 §2.2.1 4-단계 중 2-단계 = trim-empty zero. |
|
||||
| R-3 | resolved (wire schema 변경 0) | parser_version `"pdf-text-v1"` / chunker_version `"pdf-page-v1.1"` 보존. version cascade 영향 0 (CLAUDE.md §Versioning cascade). |
|
||||
| R-4 | resolved per critic r1 (grep boundary = `integrations/` + `crates/kebab-cli/src` + `docs/wire-schema/v1`) | Step 2 가 spec 범위 내 2 surface 모두 커버. **추가 발견 (out of scope)** → §5 OQ-4. |
|
||||
| R-5 | resolved (`bulk.rs:161` alias normalize 통해 영향 0) | 본 plan 동작 변경 0. |
|
||||
|
||||
추가 risk — 본 plan drafter 가 식별:
|
||||
|
||||
- **R-6 (NEW)**: Step 4 의 optional dogfood retest 가 `KEBAB_WORKSPACE` 또는 user-private corpus 의존. CI 환경에서 verify 불가 — verifier round 0 가 evidence 부재 시 row 8 skip 명시 권고.
|
||||
|
||||
---
|
||||
|
||||
## §5 Open questions for executor
|
||||
|
||||
spec ACCEPT 가 명확하므로 OQ-1/2/3 모두 resolved. 본 plan drafter 가 추가 식별:
|
||||
|
||||
- **OQ-4 (NEW)**: spec §2.2 의 R-4 grep 결과, frozen boundary 외부 surface 가 동일 stale list 보유:
|
||||
- `crates/kebab-mcp/src/tools/search.rs:44` — MCP tool 의 `--media` doc.
|
||||
- `crates/kebab-core/src/search.rs:32` — `MEDIA_KINDS` const = `&["markdown", "pdf", "image", "audio", "other"]`. 주의: 이 const 가 functional 일 수 있음 — `code` 는 v0.18.0 부터 separate path 로 first-class 처리 (`schema.v1.media_breakdown.code` 존재 확인 per spec §1.2). const 자체 수정은 behavior change risk 동반 → 별도 spec 으로 분리.
|
||||
- `crates/kebab-core/src/search.rs:52` — `MediaFilter::media` doc-comment.
|
||||
- `crates/kebab-app/src/ingest_progress.rs:69` — progress label doc-comment.
|
||||
- `crates/kebab-cli/tests/wire_schema_breakdowns.rs:35` — test fixture array (functional, 변경 시 test 의미 영향).
|
||||
|
||||
**executor action**: 본 round 미포함. PR description 또는 Step 2 commit body 에 "follow-up: open issue for stale --media value list in 5 additional surfaces" 한 줄 명시 권장.
|
||||
|
||||
- **OQ-5 (NEW)**: spec §6 의 UX consequence — pre-bugfix2 v0.20.0 user 의 `--force-reingest` 권고가 release notes 에 들어가야 하며, 별도 phase (PR review/merge 시점) 의 작업. 본 plan 의 Step 1 §2.2.3 HOTFIXES entry 가 user-facing surface 의 일부 — release notes 가 HOTFIXES 의 user action 항목을 인용 가능.
|
||||
|
||||
---
|
||||
|
||||
## §6 References
|
||||
|
||||
- **Spec ACCEPT (parent contract)**: `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix2-spec.md` (308 line, round 1c).
|
||||
- **Critic round 1**: `.omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r1-result.md` (H-1 + M-1/M-2/M-3 + L-1/L-2 + NIT-1/NIT-2 + invariant audit, 9 finding 모두 spec 에 반영).
|
||||
- **Critic round 2**: `.omc/reviews/2026-05-27-v0.20-bugfix2-spec-critic-r2-result.md` (NEW-1 = §4.2 stale arithmetic, NEW-2 = §4.3 scope description drift — 본 plan §2.2.2 + §2.6.1 에 정정 반영).
|
||||
- **Plan drafter brief**: `.omc/reviews/2026-05-27-v0.20-bugfix2-plan-drafter-brief.md`.
|
||||
- **Parent design**: `docs/superpowers/specs/2026-05-27-pdf-scanned-ocr-spec.md` §1.3 (text-detect threshold metric), §9 (version cascade).
|
||||
- **Round 1 history**: branch `feat/pdf-scanned-ocr` HEAD = `e674ff4`, 4 commit 적층 (Bug #2 source-fs, Bug #3 chunk_id collision, Bug #3 test, Bug #4 pikepdf F4 fixture).
|
||||
- **Code locations (line 실측)**:
|
||||
- `crates/kebab-parse-pdf/src/text_quality.rs:1-103` (전체 file).
|
||||
- `crates/kebab-cli/src/main.rs:158-160` (SearchArgs `media` clap doc-comment, 3-line multi-line attribute).
|
||||
- `integrations/claude-code/kebab/SKILL.md:57` (search input filter 설명).
|
||||
- `crates/kebab-cli/tests/cli_help_smoke.rs` (신규, Step 3).
|
||||
- **External source**: `lopdf-0.32.0/src/document.rs:523` (`Document::decode_text` — sole emitter of `?Identity-H Unimplemented?`).
|
||||
|
||||
---
|
||||
|
||||
## §7 Constraints (spec §9 + brief §9)
|
||||
|
||||
1. **branch 변경 0** — plan 자체는 documentation only. 본 file = plan deliverable.
|
||||
2. **spec ACCEPT frozen** — round 1c body 보수 X. 본 plan 의 §2.2 / §2.6 의 wording 정정 (`Replace existing` → `+2 / -0 additive`) 은 plan 의 local note 로 명문, spec 본문 미변경.
|
||||
3. **regression 0** — workspace test N ≥ 1319.
|
||||
4. **wire schema / version cascade 변경 0** — `parser_version="pdf-text-v1"`, `chunker_version="pdf-page-v1.1"` 보존.
|
||||
5. **subagent skip** — executor 가 in-session 단일 thread 실행 (worker protocol per task assignment).
|
||||
6. **lightweight scope** — 본 plan 의 line target = 200-400 (round 1 plan = 849 line 의 1/3 미만).
|
||||
|
||||
**Status**: DRAFT round 0 — verifier review 대기.
|
||||
1043
docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md
Normal file
1043
docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix3-plan.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user