Files
kebab/docs/superpowers/plans/2026-05-27-v0.20-sub1-bugfix-plan.md
altair823 46e99470eb docs(superpowers): v0.20 sub-item 1 bugfix1/2/3 specs + plans + DOGFOOD.md
3-round dogfood-driven fix cycle 의 산출물:

- bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line
- bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line
- bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line
- docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus)

각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 01:21:34 +00:00

62 KiB
Raw Blame History

title, created, status, target_version, spec, contract_sections, parent_plan, review_history
title created status target_version spec contract_sections parent_plan review_history
v0.20.0 sub-item 1 bugfix — implementation plan 2026-05-27 ACCEPT (round 2 closure — Phase B complete) 0.20.0 (PR docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md
§9 (chunker_version cascade)
docs/superpowers/plans/2026-05-27-pdf-scanned-ocr-plan.md
2026-05-27 plan round 0 (opus, drafter) — 5 step group A-E, 18 sub-action
2026-05-27 plan round 1 critic (opus, thorough) — NEEDS_DISCUSSION, HIGH 1 + MEDIUM 2 + LOW 3 + NIT 1 (7 finding)
2026-05-27 plan round 1 verifier (opus, thorough) — NEEDS_DISCUSSION, HIGH 4 + MEDIUM 3 + LOW 4 + NIT 3 (14 finding)
2026-05-27 plan round 1c rewrite (opus, drafter) — 21 finding 모두 적용 (critic 7 + verifier 14). detail = §8 round 1c rewrite changelog
2026-05-27 plan round 2 closure critic (opus) — ACCEPT, 21/21 applied + 4 NIT cosmetic

v0.20.0 sub-item 1 bugfix plan

ACCEPT 된 spec (docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md, 965 lines, round 2 closure) 의 step decomposition. 3 bug (#2 walker code limit / #3 chunk_id collision Critical / #4 F4 fixture Pages tree) 의 force-update path (PR #189 base branch feat/pdf-scanned-ocr 위에 fix commits stack). 5 step (Group A-E), 18 sub-action, 4 commit + 1 verify-only step (= 5 step total, 4 commit boundary). spec §6 의 16-row acceptance 가 본 plan 의 §4 verifier checklist 로 1:1 mapping.

§0 Pre-flight + branch state

  • Branch: feat/pdf-scanned-ocr (PR #189 base, HEAD = b4d9e60 "chore(release): bump version 0.19.0 → 0.20.0"). 사용자 메모리 feedback_pr_workflow 따라 force-update path — 같은 branch 위에 fix commits 추가 + PR #189 force-push.
  • Working dir: /home/altair823/kebab.
  • Env 강제 (~/.claude/CLAUDE.md "Disk Layout — 루트 디스크 보호가 최우선"):
    • export CARGO_TARGET_DIR=/build/out/cargo-target/target — XFS 4TB 전용 디스크 격리, repo root target/ 생성 방지.
    • export RELEASE_BIN="${CARGO_TARGET_DIR:-target}/release/kebab" — release binary alias (Step 5 dogfood 의 모든 acceptance command 에서 사용).
    • export TMPDIR=/build/cache/tmp — 대용량 임시 파일 보호.
  • Cargo build 직렬화 (memory feedback_serial_build_only):
    • per-crate: -j 4 default (예: cargo test -p kebab-chunk -j 4).
    • workspace: -j 1 강제 (cargo test --workspace -j 1, cargo clippy --workspace -j 1 — 18 integration-test binary 동시 link 시 OOM).
  • target/ clean policy (memory feedback_cargo_clean_policy): /build XFS 4TB 분리라 routinely clean 금지. df -h /buildAvail < 500G OR du -sh $CARGO_TARGET_DIR > 200G 시에만 clean. Step 5 E1 의 first cargo invoke 직전 1 회 conditional check, 임계 미달 시 skip + commit body 안 "skipped cargo clean — /build avail X TB" 1줄 record.
  • dogfood KB layout 가정 (Step 5 E3 prerequisite, critic round 1 H-1 closure): canonical config path = /build/cache/tmp/v0.20-dogfood/config.toml (in-place, KB dir 안). 외부 backup file /build/cache/tmp/v0.20-dogfood-config.toml존재 안 함 — 본 plan 의 모든 acceptance command 는 in-place config 기준. Step 5 E3 의 KB clean 은 destructive rm -rf 금지, config 보존 selective clean 사용 (E3 detail 참조). dogfood config canonical path 는 본 §0 의 한 곳에서만 정의 — Step 5 E3 의 command 가 이 path 참조.
  • HOTFIXES.md / README / HANDOFF / ARCHITECTURE 영향: Step 2 B4 가 HOTFIXES.md entry 추가 (Bug #3 second-iteration patch). 그 외 사용자 visible surface 변경 0 — README + HANDOFF + ARCHITECTURE 갱신 0 (CLI flag / wire schema / TUI key / config 추가 0; chunker_version bump 은 internal cascade 라 release notes 만).
  • wire schema 변경 0ingest_progress.v1 + ingest_report.v1 추가 field 0. V00X migration 0. chunks table DDL unchanged.
  • frozen design contract 변경 0 — design §9 cascade rule 자체 변경 0 (rule 의 직접 적용으로 chunker_version 만 bump).
  • workspace version bump 0 — v0.20.0 이 이미 cut (commit b4d9e60). 본 plan 은 같은 v0.20.0 안의 cumulative bugfix (PR #189 force-update). Step 5 E5 의 PR force-push 만, release tag 재컷 0.

§1 Plan overview + spec linkage

Spec §3 (Bug #2) + §4 (Bug #3) + §5 (Bug #4) 의 fix design 을 atomic step 으로 decompose. 핵심 sequencing:

  1. Bug #2 walker code limit fix (Step 1) — is_code_file helper + walker conditional + unit test. spec §3.4 + §3.5 의 diff 그대로 적용. 1 commit.
  2. Bug #3 chunk_id fix + chunker_version bump (Step 2) — chunk_page return tuple 4-tuple 확장 + caller per_chunk_hash suffix 를 segment_start 로 변경 + VERSION_LABEL "pdf-page-v1""pdf-page-v1.1" bump + module doc 갱신 + HOTFIXES.md entry + unit regression test. spec §4.4 + §4.4.1 + §4.5 의 diff 그대로 적용. 1 commit.
  3. Bug #3 integration test (Step 3) — crates/kebab-app/tests/ 안 multi-scanned PDF chunk_id collision-free integration test. spec §4.5.1 의 MockOcrEngine pre-condition 결정 (Option A share 또는 Option B inline) 이 executor 의 first sub-action. 1 commit.
  4. Bug #4 F4 fixture re-generation (Step 4) — tests/fixtures/_synth/mojibake.py 의 pikepdf-based rewrite + F4 fixture binary regenerate + parse-pdf 의 3 신규 invariant test. spec §5.4 + §5.5 + §5.6 의 diff 그대로 적용. 1 commit.
  5. Workspace verify + commit + PR force-push (Step 5) — cargo workspace test -j 1 + clippy -D warnings + dogfood re-run (/build/cache/tmp/v0.20-dogfood isolated KB, qwen2.5vl:3b 의 Ollama endpoint 192.168.0.47:11434) + PR #189 force-push. spec §6 16-row consolidated acceptance 가 본 step 의 verifier checklist.

ordering invariant:

  • Step 1 || Step 2 || Step 4 mutually independent: 3 bug 의 fix 가 서로 다른 crate (kebab-source-fs / kebab-chunk / tests/fixtures + kebab-parse-pdf) 의 file path 에 한정 — 동시 진행 가능. 정합성 우선 → Step 1 → Step 2 → Step 4 sequential.
  • Step 2 < Step 3: integration test 가 kebab-chunk 의 fix 된 chunk_id 계산 path 위에 의존. Step 2 의 GREEN 이 prerequisite.
  • Step 4 < Step 5 dogfood: F4 fixture regeneration 의 결과 binary 가 dogfood 의 9 PDF 중 1 (mojibake) — Step 5 E3 dogfood 의 block_count: 1 invariant 검증 prerequisite.
  • Step 1-4 all < Step 5 workspace test: workspace 전체 test 가 production code + test 의 final state 위에서만 의미.

commit 단위는 logical group 1 commit (atomic) — §7 sequencing summary 의 5-commit table 따름. 사용자 memory feedback_pr_workflow (gitea-pr + 리뷰 루프) 따라 force-update 후 gitea-pr-review skill 의 review 루프 진입.


§2 Step group structure (Group A-E)

Step Group 분류 sub-action
1 A Bug #2 walker code limit fix A1 is_code_file helper + A2 walker conditional + A3 unit test
2 B Bug #3 chunk_id collision fix + chunker_version bump B1 chunk_page 4-tuple + B2 caller per_chunk_hash + B3 VERSION_LABEL bump + B4 module doc + HOTFIXES.md + B5 unit regression test
3 C Bug #3 multi-scanned PDF integration test C1 MockOcrEngine share decision + C2 integration test (conditional)
4 D Bug #4 F4 fixture re-generation D1 mojibake.py pikepdf rewrite + D2 fixture regenerate + commit + D3 parse-pdf 3 invariant test
5 E Workspace verify + commit + PR force-push E1 cargo workspace test -j 1 + E2 clippy -D warnings + E3 dogfood re-run + E4 commit + E5 PR #189 force-push

§3 Per-step detail

Step 1 (Group A): Bug #2 walker code limit fix

spec §3 의 Option A (code path only) — is_oversized 호출을 is_code_file(path) conditional 로 gate. PDF/image/markdown 의 size 는 parser 단계 자체 검증 (lopdf load_mem 256 KB+ 정상, image OCR 의 max_pixels self-cap).

Sub-action A1 — is_code_file helper 추가

  • Files affected: crates/kebab-source-fs/src/code_meta.rs (line 129 is_oversized 함수 직후, 또는 code_lang_for_path 정의 직후).
  • Action (spec §3.4 diff 그대로):
    /// Returns true when `path`'s filename/extension is recognised as a code
    /// file (per `code_lang_for_path`). Used by the walker to apply
    /// `[ingest.code].max_file_bytes` / `max_file_lines` only to code files,
    /// not to PDF/image/markdown (which have their own size controls in
    /// their respective parsers).
    pub(crate) fn is_code_file(path: &Path) -> bool {
        code_lang_for_path(path).is_some()
    }
    
  • Acceptance:
    • grep -c "pub(crate) fn is_code_file" crates/kebab-source-fs/src/code_meta.rs = 1.
    • cargo build -p kebab-source-fs -j 4 green.

Sub-action A2 — walker conditional size check

  • Files affected: crates/kebab-source-fs/src/connector.rs:168-190 (현재 verified line range).
  • Action (spec §3.4 diff 그대로 — is_oversized 호출 앞에 is_code_file short-circuit):
    -            // Size-cap check (byte or line limit).
    -            if crate::code_meta::is_oversized(
    -                &abs_path,
    -                self.max_file_bytes,
    -                self.max_file_lines,
    -            )
    -            .unwrap_or(false)
    +            // v0.20.0 sub-item 1 bugfix (#2): size-cap applies ONLY to
    +            // code files. PDF/image/markdown bypass — their parsers
    +            // have their own size controls. spec §3.3.
    +            if crate::code_meta::is_code_file(&abs_path)
    +                && crate::code_meta::is_oversized(
    +                    &abs_path,
    +                    self.max_file_bytes,
    +                    self.max_file_lines,
    +                )
    +                .unwrap_or(false)
                {
                    fs_skips.skipped_size_exceeded =
                        fs_skips.skipped_size_exceeded.saturating_add(1);
                    ...
                    tracing::debug!(
                        path = %rel_path.display(),
                        max_bytes = self.max_file_bytes,
                        max_lines = self.max_file_lines,
    -                    "skip: file exceeds size cap"
    +                    "skip: code file exceeds size cap"
                    );
                    continue;
                }
    
  • Acceptance:
    • grep -nE "is_code_file\(&abs_path\)\s*$" crates/kebab-source-fs/src/connector.rs1.
    • grep -c "skip: code file exceeds size cap" crates/kebab-source-fs/src/connector.rs1.
    • cargo build -p kebab-source-fs -j 4 green.

Sub-action A3 — Bug #2 unit test 추가

  • Files affected: crates/kebab-source-fs/src/connector.rs 의 기존 #[cfg(test)] mod tests (spec §3.5 "기존 test module 에 추가" 명시 — 새 file 아님).
  • Action (spec §3.5 의 size_cap_skips_only_code_files test body 그대로):
    • 300 KB PDF / 300 KB markdown / 300 KB big.rs (3 file) tempdir 합성.
    • FsSourceConnector (max_file_bytes = 262_144, max_file_lines = 5_000) 의 scan_with_skips(&SourceScope::default()).
    • assertions:
      • paths.contains("paper.pdf") (PDF walker pass).
      • paths.contains("notes.md") (Markdown walker pass).
      • !paths.contains("big.rs") (code file walker skip).
      • skips.skip_examples.size_exceededbig.rs 1 entry, paper.pdf 0 entry.
    • cfg helper: 기존 test module 의 cfg_with_size_cap(root, max_bytes, max_lines) 패턴 재사용 (필요 시 helper 추가).
  • Acceptance:
    • cargo test -p kebab-source-fs size_cap_skips_only_code_files -j 4 green.
    • 기존 ingest_report_counts_oversized_files_by_bytes (fixture huge.rs) + ingest_report_size_cap_by_line_count (fixture longfile.rs) regression 0 — fixture 명이 .rs 라 새 conditional 통과 (invariant preserved).
    • cargo test -p kebab-source-fs -j 4 전체 green.

Commit (Step 1 전체)

fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2)

- crates/kebab-source-fs/src/code_meta.rs: add pub(crate) fn is_code_file
- crates/kebab-source-fs/src/connector.rs: walker conditional `is_code_file && is_oversized`
- crates/kebab-source-fs/src/connector.rs mod tests: size_cap_skips_only_code_files unit test
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §3

Step 2 (Group B): Bug #3 chunk_id collision fix + chunker_version bump

spec §4.3 의 Option A (segment boundary startper_chunk_hash suffix 로). chunk_page return tuple 을 3-tuple (actual_start, chunk_end, slice) → 4-tuple (segment_start, actual_start, chunk_end, slice) 로 확장 + caller per_chunk_hash suffix 를 segment_start 로 변경. VERSION_LABEL "pdf-page-v1""pdf-page-v1.1" bump (spec §4.4.1 round 1c M-1 decision — explicit cascade audit trail).

Sub-action B1 — chunk_page 4-tuple expansion

  • Files affected: crates/kebab-chunk/src/pdf_page_v1.rs:200-204 (doc comment) + :205-289 (signature line 205 → closing } line 289). 본 critic round 1 + verifier round 1 의 actual probe 결과 정정 (L-1).
  • Action (spec §4.4 diff 그대로):
    • doc comment 갱신 — (char_start, char_end, text_slice)(segment_start, actual_start, chunk_end, text_slice):
      /// Split a single page's text into ordered chunks, each represented as
      /// `(segment_start, actual_start, chunk_end, text_slice)`.
      ///
      /// - `segment_start` = pre-overlap segment boundary. Strictly increasing
      ///   across the returned vec. Use this for chunk_id uniqueness suffixes.
      /// - `actual_start` = post-overlap start char index. May collapse to a
      ///   previous chunk's `actual_start` under aggressive overlap policy.
      ///   Use this for `SourceSpan::Page::char_start`.
      /// - `chunk_end` = chunk's end char index (exclusive).
      fn chunk_page(text: &str, target_bytes: usize, overlap_bytes: usize)
          -> Vec<(usize, usize, usize, String)>
      
    • early return: vec![(0, n, text.to_string())]vec![(0, 0, n, text.to_string())].
    • loop body 의 push: chunks.push((actual_start, chunk_end, slice))chunks.push((start, actual_start, chunk_end, slice)). (start = bounds[seg_idx] 는 이미 local var 로 존재 — line 245.)
    • overlap walk 의 let prev_min = prev.0 가 기존 tuple 의 첫 field = post-fix tuple shape 에서는 prev.1 (actual_start) — spec §4.4 의 invariant 보존 위해 변경:
      -        let actual_start = if let Some(prev) = chunks.last() {
      -            let prev_min = prev.0;
      +        let actual_start = if let Some(prev) = chunks.last() {
      +            // prev tuple shape = (segment_start, actual_start, chunk_end, slice).
      +            // overlap walk floor = previous chunk's actual_start (prev.1).
      +            let prev_min = prev.1;
                  ...
      
  • Acceptance:
    • grep -nE "fn chunk_page.*-> Vec<\(usize, usize, usize, String\)>" crates/kebab-chunk/src/pdf_page_v1.rs = 1.
    • grep -c "let prev_min = prev.1" crates/kebab-chunk/src/pdf_page_v1.rs1.
    • cargo build -p kebab-chunk -j 4 green (caller B2 sub-action 동시 적용 후 red 해소).

Sub-action B2 — caller per_chunk_hash suffix → segment_start

  • Files affected: crates/kebab-chunk/src/pdf_page_v1.rs:149-186 (현재 verified — chunk method 의 for (...) in chunk_page(...) loop start line 149 → loop end line 186, verifier round 1 L-2 정정).
  • Action (spec §4.4 diff 그대로):
    -            for (char_start, char_end, slice) in
    -                chunk_page(&p.text, target_bytes, overlap_bytes)
    +            for (segment_start, char_start, char_end, slice) in
    +                chunk_page(&p.text, target_bytes, overlap_bytes)
                 {
                     ...
                     let span = SourceSpan::Page {
                         page: page_num,
                         char_start: Some(char_start_u32),
                         char_end: Some(char_end_u32),
                     };
                     let block_ids: Vec<BlockId> = vec![p.common.block_id.clone()];
    -                // Per-chunk policy_hash variant prevents chunk_id
    -                // collision when a page produces multiple chunks. See
    -                // module docs for rationale.
    -                let per_chunk_hash = format!("{base_policy_hash}#c{char_start}");
    +                // v0.20.0 sub-item 1 bugfix (#3): per-chunk policy_hash
    +                // variant uses `segment_start` (pre-overlap boundary,
    +                // strictly increasing) instead of `char_start` (post-
    +                // overlap, may collapse to prev_min). See module docs +
    +                // spec §4.1 root cause + HOTFIXES.md 2026-05-27.
    +                let per_chunk_hash = format!("{base_policy_hash}#c{segment_start}");
                     let chunk_id =
                         id_for_chunk(&doc.doc_id, &chunker_version, &block_ids, &per_chunk_hash);
                     ...
                 }
    
    • SourceSpan::Page.char_start 는 여전히 post-overlap char_start (= actual_start) 보존 — citation locality semantic 유지.
  • Acceptance (verifier round 1 M-2: B2+B4 가 같은 logical commit 안 → grep 시점 = Step 2 commit time, 즉 post-B4):
    • grep -c "#c{segment_start}" crates/kebab-chunk/src/pdf_page_v1.rs1 (B2 단독 적용 시 = 1 call site; B4 module doc 적용 후 = 2 — B4 acceptance 가 ≥ 2 검증).
    • grep -c "#c{char_start}" crates/kebab-chunk/src/pdf_page_v1.rs = 0 (call site + module doc 모두 segment_start 로 교체 — B2+B4 의 same-commit consolidated invariant).
    • sub-action-by-sub-action 분리 검증 시 B2 단독 grep #c{char_start} 는 module doc line 56 의 literal 잔존으로 ≥ 1 — Step 2 commit boundary 도달 후 = 0 으로 확정.

Sub-action B3 — VERSION_LABEL bump "pdf-page-v1""pdf-page-v1.1" + hardcoded literal 2 site 갱신

  • Files affected (verifier round 1 H-1 의 actual probe grep -rn '"pdf-page-v1"' crates/ --include='*.rs' 결과 2 site enumerate):
    • crates/kebab-chunk/src/pdf_page_v1.rs:67 (현재 verified — const VERSION_LABEL: &str = "pdf-page-v1";).
    • crates/kebab-app/tests/pdf_pipeline.rs:168 (현재 verified — assert_eq!(pdf_item.chunker_version.as_ref().map(|c| c.0.as_str()), Some("pdf-page-v1")) hard assertion, v1.1 bump 후 fail).
    • crates/kebab-app/tests/pdf_pipeline.rs:368 (현재 verified — error message string literal "pdf-page-v1 emits 0 chunks for the empty page; total = 2", hard assertion 아니지만 stale 방지).
  • Action (spec §4.4.1 결정):
    • (a) primary const bump (crates/kebab-chunk/src/pdf_page_v1.rs:67):
      -const VERSION_LABEL: &str = "pdf-page-v1";
      +const VERSION_LABEL: &str = "pdf-page-v1.1";
      
      기존 test chunker_version_is_pdf_page_v1 (pdf_page_v1.rs:374) 의 assertion 은 VERSION_LABEL const 인용 → 자동 갱신, test code 변경 불요.
    • (b) test assertion literal 갱신 (crates/kebab-app/tests/pdf_pipeline.rs:168, required):
      -        Some("pdf-page-v1")
      +        Some("pdf-page-v1.1")
      
    • (c) test error message literal 갱신 (crates/kebab-app/tests/pdf_pipeline.rs:368, recommended):
      -        "pdf-page-v1 emits 0 chunks for the empty page; total = 2"
      +        "pdf-page-v1.1 emits 0 chunks for the empty page; total = 2"
      
  • Acceptance:
    • grep -nE 'const VERSION_LABEL: &str = "pdf-page-v[0-9.]+";' crates/kebab-chunk/src/pdf_page_v1.rs 결과 = "pdf-page-v1.1".
    • cargo test -p kebab-chunk chunker_version_is_pdf_page_v1 -j 4 green (VERSION_LABEL 인용이라 자동 통과).
    • grep -rn '"pdf-page-v1"' crates/ --include='*.rs' | grep -v 'pdf-page-v1\.1' = 결과 0 (regex 의 false-positive 방지 — pdf-page-v1.1 의 substring "pdf-page-v1" 은 ".1" suffix 로 exclude). grep -v filter 후 line 0 이면 stale literal 잔존 0.
    • cargo test -p kebab-app pdf_pipeline -j 4 green (line 168 assertion 갱신 후).

Sub-action B4 — module doc 갱신 + HOTFIXES.md entry

  • Files affected:
    • crates/kebab-chunk/src/pdf_page_v1.rs:47-60 (현재 verified — module doc ## chunk_id collision deviation 단락).
    • tasks/HOTFIXES.md (new dated entry append, 기존 entry 위치 — file 의 latest entry 가 2026-05-26 이므로 그 위에 2026-05-27 — v0.20.0 sub-item 1 entry insert; 본 file 의 chronological pattern 따름).
  • Action:
    • (a) module doc — spec §4.4 의 갱신본 그대로:
      -//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
      -//! variant `format!("{base_policy_hash}#c{char_start}")` into the
      -//! recipe's `policy_hash` slot (so distinct chunks distinguish via
      -//! different policy_hash inputs), while storing the unmodified
      -//! `base_policy_hash` in `Chunk.policy_hash` so the field still answers
      -//! "what policy was active". Logged in `tasks/HOTFIXES.md`.
      +//! Workaround that doesn't change the §4.2 recipe: feed a per-chunk
      +//! variant `format!("{base_policy_hash}#c{segment_start}")` into the
      +//! recipe's `policy_hash` slot. `segment_start` is the pre-overlap
      +//! segment boundary, strictly increasing across the returned chunks
      +//! even when the overlap walk collapses `actual_start` to a previous
      +//! chunk's `prev_min`. Unmodified `base_policy_hash` is stored in
      +//! `Chunk.policy_hash` so the field still answers "what policy was
      +//! active". v1.1 second-iteration patch — logged in
      +//! `tasks/HOTFIXES.md` (2026-05-27).
      
    • (b) HOTFIXES.md entry (spec §4.4 의 entry body 그대로):
      ## 2026-05-27 — v0.20.0 sub-item 1: chunk_id `#c{char_start}` workaround collapses under aggressive overlap (Bug #3 second-iteration patch)
      
      **Symptom**: F2 (1580 chars OCR, scanned_page2.pdf) ingest 시
      `DocumentStore::put_chunks (pdf): sqlite error: UNIQUE constraint
      failed: chunks.chunk_id: ... Error code 1555: A PRIMARY KEY constraint
      failed`. `kebab v0.20.0` (commit `b4d9e60`) dogfood (qwen2.5vl:3b 의
      `192.168.0.47:11434` Ollama endpoint, `/build/cache/tmp/v0.20-dogfood`
      isolated KB) `--force-reingest` 마다 reproducible.
      
      **Root cause**: `crates/kebab-chunk/src/pdf_page_v1.rs:170``per_chunk_hash = format!("{base_policy_hash}#c{char_start}")` 에서
      `char_start` = post-overlap `actual_start`. line 266-281 의 overlap
      walk 가 `prev_min` floor 까지만 back-walk 하므로 aggressive overlap
      + 첫 segment 가 작은 page (F2 의 한국어 OCR text: 첫 ~10 char 안
      sentence-end → segment_1 = [0, 30], segment_2 = [30, n], overlap_bytes
      240 / chars=80 → segment_2 의 actual_start 가 prev_min=0 으로
      collapse) → 두 chunk 의 `#c0` suffix identical → identical chunk_id →
      `chunks` PRIMARY KEY violation.
      
      **Fix** (spec §4.4): `chunk_page` return tuple 에 `segment_start`
      추가 (3-tuple → 4-tuple `(segment_start, actual_start, chunk_end,
      slice)`), caller `per_chunk_hash` 의 suffix 를 `segment_start` 로
      변경. `segment_start``bounds[seg_idx]` (dedup 후 strictly
      increasing) — overlap walk 와 무관하게 모든 chunk distinct. citation
      locality 의 `SourceSpan::Page.char_start` 는 여전히 post-overlap
      `actual_start` 유지.
      
      **chunker_version cascade**: `pdf-page-v1``pdf-page-v1.1` bump
      (spec §4.4.1 round 1c M-1 결정, design §9 cascade rule 의 직접 적용).
      multi-chunk PDF page (pre-OCR 시점 `metro-korea.pdf` 의 21 block /
      34 chunk 같은 정상 path) 의 chunk_id 가 변경 — explicit user-facing
      audit trail 확보, store layer 의 자동 invalidation report. v0.20.0
      force-update path 라 사용자 cost zero (어차피 fresh ingest).
      
      **Amends**: spec `docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md`
      §4.4. parent design §4.2 chunk_id recipe 자체 unchanged (workaround
      layer 의 internal computation 만 변경). parent PR #189
      (`feat/pdf-scanned-ocr`, force-update path).
      
  • Acceptance:
    • grep -c "#c{segment_start}" crates/kebab-chunk/src/pdf_page_v1.rs2 (module doc + line 170 의 actual call).
    • grep -c "2026-05-27 — v0.20.0 sub-item 1: chunk_id" tasks/HOTFIXES.md = 1.

Sub-action B5 — unit regression test multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids

  • Files affected: crates/kebab-chunk/src/pdf_page_v1.rs#[cfg(test)] mod tests (현재 verified — make_pdf_doc(&[&str]) + default_policy(target, overlap) helper 이미 존재, line 300-371).
  • Action (spec §4.5 의 test body 그대로):
    #[test]
    fn multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids() {
        // 한국어 OCR text 의 trigger shape: 10 char "가" + ". " + 500 char "나".
        // → first segment [0, 12), second segment [12, n).
        //   page_text byte_len = 10*3 + 2 + 500*3 = 1532 > target_bytes=1500
        //   → multi-chunk. overlap_bytes = min(240, 750) = 240 chars=80
        //   → second chunk 의 actual_start 가 prev_min=0 collapse → same `#c0`.
        //
        // default_policy(500, 80) — target_tokens=500 → target_bytes=500*3=1500
        // (한국어 3byte/char 환산), overlap_tokens=80 → overlap_bytes=min(240, 750)=240.
        // verifier round 1 L-3 보강.
        let early_seg: String = std::iter::repeat('가').take(10).collect();
        let tail: String = std::iter::repeat('나').take(500).collect();
        let page_text = format!("{early_seg}. {tail}");
    
        let doc = make_pdf_doc(&[&page_text]);
        let policy = default_policy(500, 80);  // target=1500 byte, overlap=240 byte
        let chunks = PdfPageV1Chunker.chunk(&doc, &policy).unwrap();
    
        assert!(
            chunks.len() >= 2,
            "expected ≥2 chunks for {} byte page; got {}",
            page_text.len(),
            chunks.len()
        );
    
        let mut ids: Vec<&str> = chunks.iter().map(|c| c.chunk_id.0.as_str()).collect();
        ids.sort_unstable();
        let total = ids.len();
        ids.dedup();
        assert_eq!(
            ids.len(),
            total,
            "all chunk_ids must be unique even when overlap walks actual_start back to prev_min"
        );
    }
    
  • Acceptance:
    • cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids -j 4 green.
    • cargo test -p kebab-chunk deterministic_chunk_ids_1000 -j 4 green (기존 determinism invariant 보존).
    • cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target -j 4 green (기존 overlap clamp invariant 보존).
    • cargo test -p kebab-chunk -j 4 전체 green.

Commit (Step 2 전체)

fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3)

- crates/kebab-chunk/src/pdf_page_v1.rs: chunk_page returns 4-tuple
  (segment_start, actual_start, chunk_end, slice); caller per_chunk_hash
  suffix uses segment_start (pre-overlap boundary, strictly increasing)
  instead of char_start (post-overlap, may collapse to prev_min).
- VERSION_LABEL "pdf-page-v1" → "pdf-page-v1.1" (design §9 cascade,
  explicit user-facing audit trail).
- module docs: workaround description updated to segment_start.
- mod tests: multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids
  regression pin.
- tasks/HOTFIXES.md: 2026-05-27 entry (symptom F2 1580 char OCR,
  intra-doc collision root cause, second-iteration patch rationale).
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §4

Step 3 (Group C): Bug #3 multi-scanned PDF integration test

spec §4.5 + §4.5.1 — kebab-app integration 수준의 chunk_id collision-free regression. real Ollama 회피 위해 OcrEngine trait 의 MockOcrEngine. 기존 crates/kebab-app/tests/pdf_ocr_apply.rs:20-45 의 private MockOcrEngine 가 같은 crate 의 별 test binary 라 직접 import 불가 — executor 의 first sub-action 으로 share path 결정.

Sub-action C1 — MockOcrEngine share decision (executor 의 dependency 확인 task)

  • Files affected (Option 별 분기):
    • Option A (share via tests/common/) — verifier round 1 H-2 의 actual probe 결과 정정:
      • crates/kebab-app/tests/common/mod.rs이미 존재 (172 line TestEnv infrastructure, #![allow(dead_code)] + pub struct TestEnv + pub fn ingest_md + pub fn lexical_query 등). action = pub mod mock_ocr; 1줄 append (mod.rs 신규 X).
      • crates/kebab-app/tests/common/mock_ocr.rs (신규 file, MockOcrEngine lift + per-page ctor).
      • 기존 pdf_ocr_apply.rs:20-45 의 MockOcrEngine struct + impl 제거 + mod common; use common::mock_ocr::MockOcrEngine; import 추가 + ctor call site migration (M-3 참조).
      • 신규 integration test 가 mod common; use common::mock_ocr::MockOcrEngine; 으로 share.
    • Option B (inline 중복): 신 test file multi_scanned_pdf_ingest_no_chunk_id_collision.rs 안에 inline struct LocalMockOcr + impl OcrEngine for LocalMockOcr (test isolation 우선, common/mod.rs touch X).
  • Action:
    • (a) dependency probe — spec §4.5.1 의 결정 path 따름:
      grep -rn "impl OcrEngine" crates/kebab-parse-image/src/ crates/kebab-app/tests/
      # 실 결과:
      #   crates/kebab-parse-image/src/ocr.rs:235 — production OllamaVisionOcr.
      #   crates/kebab-app/tests/pdf_ocr_apply.rs:25 — test-only MockOcrEngine.
      ls crates/kebab-app/tests/common/mod.rs
      # 실 결과: -rw-r--r-- ... 172 line (TestEnv infrastructure 이미 존재).
      
    • (b) executor 결정:
      • 기존 MockOcrEngine 의 ctor 가 MockOcrEngine { expected_text: String, fail: bool } — per-page 다른 text 길이 지원 위해 ctor signature 확장 필요 (예: expected_text: Vec<String> + internal Mutex<usize> cursor). 확장이 trivial + 두 test 가 같은 crate → Option A 권장.
      • Option A 시 pdf_ocr_apply.rs 의 MockOcrEngine ctor 호출 site (현재 실 verifier probe = 10 instantiation site at lines 140, 170, 193, 210, 242, 284, 311, 334, 359, 399 — critic round 1 L-2 의 "9 → 10" off-by-1 정정. struct define line 21 제외) 가 새 ctor signature 로 migration — backward-compat 위해 두 ctor (MockOcrEngine::single(text, fail) + MockOcrEngine::per_page(texts, fail)) 제공. mechanical migration: 각 site 의 MockOcrEngine { expected_text: <text>, fail: <bool> }MockOcrEngine::single(<text>, <bool>) (10 site × 1 line edit, verifier round 1 M-3 의 actual cost).
      • Option B (inline) 는 sharing 비용 > test 격리 가치 시. 본 plan 의 first preference = Option A.
    • (c) 결정 결과 record: result file (.omc/reviews/2026-05-27-v0.20-bugfix-plan-drafter-r1c-result.md) 의 closing summary 의 §6 open question 1 에 결정 path 기록 — Option A 시 sub-action C2 의 file edit = (existing) common/mod.rs append 1 line + (new) common/mock_ocr.rs + (modify) pdf_ocr_apply.rs + (new) multi_scanned_pdf_ingest_no_chunk_id_collision.rs = 4 file. Option B 시 1 new file 만.
  • Acceptance:
    • probe grep 결과 ≥ 2 line (production + existing mock).
    • probe ls 결과 — common/mod.rs existing 확인.
    • executor 의 결정이 plan 의 §6 open question OQ-1 안에 명시.

Sub-action C2 — integration test 작성 (conditional on C1 결정)

  • Files affected (Option A 채택 가정, verifier round 1 H-2 정정):
    • crates/kebab-app/tests/common/mod.rs (existing 172 line — pub mod mock_ocr; 1줄 append 만).
    • crates/kebab-app/tests/common/mock_ocr.rs (신규 — MockOcrEngine lift + per-page ctor).
    • crates/kebab-app/tests/pdf_ocr_apply.rs:20-45 (기존 inline impl 제거 + mod common; use common::mock_ocr::MockOcrEngine; add — file head 의 mod declaration 1 줄 추가) + ctor call site 10 개 mechanical migration (M-3).
    • crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs (신규) — mod common; use common::mock_ocr::MockOcrEngine; import.
  • Action (spec §4.5 의 test body — 본 plan 의 sub-action 안 expanded):
    • fixture: F1 (scanned_page1.pdf, 779 char OCR) + F2 (scanned_page2.pdf, 1580 char OCR) + 1 synthetic small-page PDF (300 char) — 3 scanned PDF.
    • MockOcrEngine ctor: per-page text vec ["text for F1", "text for F2 의 1580 char string", "text for synthetic 300 char"] + fail: false.
    • isolated KB: tempfile::tempdir() + Config::default()data_dir 만 override + workspace [ingest.pdf].enabled = true.
    • assertion path:
      1. kebab_app::ingest_with_config_opts(&cfg, ...) (facade) 호출.
      2. report.items.iter().filter(|i| i.kind == IngestItemKind::Error).count() == 0 — chunk_id collision 시 발생할 ErrorKind::Storage row 부재.
      3. store.get_chunks_count() == sum(per-PDF chunk_counts) — DELETE+INSERT path 의 final row count.
      4. store.get_all_chunk_ids().iter().collect::<HashSet<_>>().len() == chunks_count — chunk_id global uniqueness.
    • executor degradation path (spec §4.5.1 conditional downgrade): 만약 Option A 의 share 가 비용/위험 크고 Option B 도 비현실적 (예: integration setup 의 ExtractContext / Facade wiring 가 본 sub-action scope 초과) → §6 row 7 의 acceptance 를 conditional downgrade — kebab-chunk 의 unit-level invariant (Step 2 B5) 만으로 Bug #3 의 core regression 핀 확보, integration 회피.
  • Acceptance:
    • cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision -j 4 green.
    • cargo test -p kebab-app pdf_ocr_apply -j 4 green (existing test regression 0 — MockOcrEngine { expected_text, fail } literal struct construction 10 ctor site 가 MockOcrEngine::single(text, fail) 로 migration 후, critic round 1 L-2 actual count).
    • downgrade path 시: result file + commit body 안 "§6 row 7 conditional skip — Bug #3 core regression = kebab-chunk unit B5" 1줄 record.

Commit (Step 3 전체)

test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression)

- crates/kebab-app/tests/common/{mod,mock_ocr}.rs: MockOcrEngine lift
  with per-page text ctor (shared by pdf_ocr_apply.rs + new test).
- crates/kebab-app/tests/multi_scanned_pdf_ingest_no_chunk_id_collision.rs:
  3 scanned PDF (F1 + F2 + synthetic 300char) ingest via mock OCR,
  assert all chunk_ids globally unique + zero ErrorKind::Storage rows.
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §4.5

Option B (inline) 또는 conditional downgrade 채택 시 commit body 와 file list 그에 맞춰 조정.


Step 4 (Group D): Bug #4 F4 fixture re-generation

spec §5 — tests/fixtures/_synth/mojibake.py 의 byte-level re.sub + 수작업 startxref edit 를 pikepdf 의 proper PDF surgery (open + delete /ToUnicode + save 자동 xref regen) 로 교체. F4 fixture 자체 (crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf) regenerate + 3 신규 invariant test.

Sub-action D1 — tests/fixtures/_synth/mojibake.py pikepdf rewrite

  • Files affected: tests/fixtures/_synth/mojibake.py (전체 rewrite — 기존 byte-edit 패턴 폐기).
  • Action (spec §5.4 의 body 그대로):
    • Step 1: reportlab 으로 Type 0 (CID) font 사용 한국어 PDF 합성 (정상 ToUnicode CMap 포함).
    • Step 2: pikepdf 로 open + 모든 dictionary 의 /ToUnicode entry 제거 + pdf.save(allow_overwriting_input=True) (xref 자동 regen).
    • Step 3: invariant 검증 — len(pdf.pages) == 1 + b"/ToUnicode" not in dst.read_bytes().
    • 실패 시 비-zero exit code + stderr message (Step 2 의 removed count = 0 → exit 2; Step 3 의 page count mismatch → exit 3; ToUnicode 잔존 → exit 4).
  • Dep install (executor 의 pre-action):
    pip install --cache-dir /build/cache/pip pikepdf reportlab
    python -c "import pikepdf; import reportlab; print(pikepdf.__version__, reportlab.Version)"
    # font availability probe (critic round 1 L-3) — mojibake.py 의 hardcode path.
    test -f /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf \
        || sudo apt-get install -y fonts-dejavu-core
    
    CI 환경 미반영 — fixture 자체를 commit 하므로 generation 은 1회성 (Step 4 D2 의 executor local). tasks/HOTFIXES.md 에 pikepdf install hint 만 1줄 추가 가능.
  • Acceptance:
    • grep -c "import pikepdf" tests/fixtures/_synth/mojibake.py = 1.
    • grep -c "re.sub" tests/fixtures/_synth/mojibake.py = 0 (byte-edit 패턴 폐기 확인).
    • test -f /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf exit 0 (font probe, critic round 1 L-3 fast failover signal).
    • python tests/fixtures/_synth/mojibake.py /tmp/mojibake_dryrun.pdf && echo OK exit 0 + stderr 무.

Sub-action D2 — F4 fixture binary regenerate + snapshot regen + commit

  • Files affected (verifier round 1 H-4 + critic round 1 M-2 의 actual probe grep -rn 'fixtures/mojibake.pdf' crates/ 결과 2 consumer enumerate):
    • crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf (regenerate).
    • crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json (snapshot baseline file 자체 — delete + auto-regen). verifier round 1 H-4 의 actual probe text_extractor_regression.rs:59-64 의 hand-rolled unwrap_or_else { write baseline } 패턴.
    • crates/kebab-parse-pdf/tests/text_extractor_regression.rs (existing test — 코드 자체 변경 0, snapshot regen path 만 trigger).
    • crates/kebab-parse-pdf/src/text_quality.rs:96 (verifier round 1 H-4 의 2번째 consumer — let bytes = include_bytes!("../tests/fixtures/mojibake.pdf"); 의 unit test/doctest 가 fixture binary 변경 시 동시 verify, 코드 변경 0).
  • Action:
    • (a) regenerate command:
      python tests/fixtures/_synth/mojibake.py \
          crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
      
    • (b) regenerate 후 manual probe:
      python -c "import pikepdf; pdf = pikepdf.open('crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf'); print(len(pdf.pages))"
      # expected: 1
      grep -c "/ToUnicode" crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf
      # expected: 0 (binary grep — Pages dict 안 ToUnicode 부재)
      
    • (c) snapshot baseline regen (verifier round 1 H-4 + critic round 1 M-2 의 actual mechanic — OQ-2 closure):
      • text_extractor_regression.rs:59-64let baseline = std::fs::read_to_string("tests/snapshots/vector_pdf_canonical.json").unwrap_or_else(|_| { std::fs::write(baseline_path, &actual).expect(...); actual.clone() }) 의 hand-rolled pattern (insta crate 사용 X).
      • fixture binary 변경 → 다음 cargo test 시 actual != baselineassert_eq! fail.
      • executor 의 regen step:
        rm crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json
        cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4
        # 1st run: snapshot file 부재 → unwrap_or_else write 패턴이 새 baseline 작성 → assert pass.
        cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4
        # 2nd run: 새 baseline 와 byte-identical → assert pass (regression invariant 확립).
        
      • OQ-2 closure — insta crate 미사용, cargo-insta CLI 불요. spec §5.6 의 "기존 text_extractor_regression.rs 의 F4 baseline 갱신" 의 actual mechanic 명문화.
  • Acceptance:
    • stat crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf size > 0.
    • grep -c "/ToUnicode" crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf = 0.
    • python 의 page count probe = 1.
    • test -f crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json exit 0 (snapshot regen 후).
    • cargo test -p kebab-parse-pdf vector_pdf_extract_byte_identical_to_baseline -j 4 2회 연속 green (regen + verify, H-4 mechanic).
    • cargo test -p kebab-parse-pdf -j 4 전체 green — D3 의 신규 test + text_quality.rs:96 의 2번째 consumer 도 동시 verify.

Sub-action D3 — parse-pdf 의 3 신규 invariant test

  • Files affected (verifier round 1 H-3 의 actual probe 결과 결정):
    • actual ls crates/kebab-parse-pdf/tests/ 결과: common, extractor.rs, fixtures, ocr_e2e.rs, page_image.rs, snapshots, text_extractor_regression.rs — plan round 0 의 primary candidate text_extractor.rs존재 안 함.
    • 결정: crates/kebab-parse-pdf/tests/text_extractor_regression.rs append (F4 fixture consumer locality + D2 snapshot regen mechanic 와 same file). 3 신규 #[test] fn append.
    • 대안 (executor 가 file size / cohesion 고려해 split 결정 시): 신규 crates/kebab-parse-pdf/tests/mojibake_invariants.rs. plan first preference = append to text_extractor_regression.rs.
  • Action (spec §5.5 의 3 test body 의 path 정정 — verifier round 1 H-3):
    1. mojibake_fixture_load_yields_one_pagelet bytes = include_bytes!("fixtures/mojibake.pdf"); (integration test 는 이미 crates/kebab-parse-pdf/tests/ root, text_extractor_regression.rs:42 의 canonical pattern 따름; spec §5.5 의 "../tests/fixtures/mojibake.pdf" 가 잘못 — "fixtures/mojibake.pdf" 직접). lopdf::Document::load_mem(bytes).unwrap().get_pages().len() == 1.
    2. mojibake_fixture_has_no_tounicode_cmap — CWD-relative std::fs::read("tests/fixtures/mojibake.pdf") 위험 회피 (cargo test 의 CARGO_MANIFEST_DIR ≠ CWD 환경 가능): let bytes = include_bytes!("fixtures/mojibake.pdf"); 사용. bytes.windows(b"/ToUnicode".len()).filter(|w| *w == b"/ToUnicode").count() == 0.
    3. pdf_text_extractor_on_mojibake_yields_one_blocklet bytes = include_bytes!("fixtures/mojibake.pdf"); + PdfTextExtractor 의 1 Block::Paragraph per page invariant 검증, canonical.blocks.len() == 1, scanned candidate warning 또는 non-empty text. ExtractContext setup 의 actual body 는 executor 가 text_extractor_regression.rs 의 existing helper (있을 시) 또는 spec §5.5 의 placeholder 의 expansion.
  • Acceptance:
    • cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page -j 4 green.
    • cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap -j 4 green.
    • cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block -j 4 green.
    • cargo test -p kebab-parse-pdf -j 4 전체 green.

Commit (Step 4 전체)

fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4)

- tests/fixtures/_synth/mojibake.py: full rewrite — replace byte-level
  re.sub + manual startxref edit with pikepdf open+del+save (auto xref
  regen). Type 0 font + ToUnicode strip via dictionary walk.
- crates/kebab-parse-pdf/tests/fixtures/mojibake.pdf: regenerate.
- crates/kebab-parse-pdf/tests/text_extractor_regression.rs: append 3
  invariant tests (lopdf 1-page / no ToUnicode marker / PdfTextExtractor
  1-block) — verifier round 1 H-3 의 file path decision (same locality
  with snapshot regen).
- crates/kebab-parse-pdf/tests/snapshots/vector_pdf_canonical.json:
  delete + auto-regen via 2-run cargo test (hand-rolled unwrap_or_else
  pattern, verifier round 1 H-4).
- spec: docs/superpowers/specs/2026-05-27-v0.20-sub1-bugfix-spec.md §5

verifier round 1 NIT-2 정정 — commit scope test-fixtureparse-pdf (crate name, conventional commit typical scope).


Step 5 (Group E): Workspace verify + commit + PR #189 force-push

spec §6 의 16-row consolidated acceptance 를 본 step 의 verifier checklist. 모든 acceptance command 가 scriptable.

Sub-action E1 — cargo workspace test -j 1

  • Files affected: 변경 0 (verification only).
  • Action:
    • (a) conditional cargo clean — memory feedback_cargo_clean_policy (verifier round 1 NIT-1 정정: TB vs GB unit 혼동 회피 위해 -BG 으로 GB unit 강제):
      # /build avail 을 GB 단위 정수로 직접 가져옴 (df -BG output 의 'G' suffix 만 strip).
      AVAIL_GB=$(df -BG --output=avail /build | tail -1 | tr -d ' G')
      # CARGO_TARGET_DIR 의 size 도 GB 단위 정수로 (du -BG output).
      TARGET_GB=$(du -BG -s "${CARGO_TARGET_DIR:-target}" 2>/dev/null | awk '{print $1}' | tr -d 'G')
      # /build avail < 500 GB OR target > 200 GB → clean
      if [[ "${AVAIL_GB:-9999}" -lt 500 ]] || [[ "${TARGET_GB:-0}" -gt 200 ]]; then
          cargo clean
      fi
      
      임계 미달 시 skip + commit body / result file 안 1줄 record (예: "skipped cargo clean — /build avail ${AVAIL_GB}G, target ${TARGET_GB}G").
    • (b) workspace test:
      cargo test --workspace --no-fail-fast -j 1 2>&1 | tail -100
      
    • tail 100 line + final summary "test result: ok. N passed; 0 failed" 확인.
  • Acceptance:
    • exit code 0.
    • stdout 의 "test result: ok" + "0 failed".
    • spec §6 row 14 (workspace full test pass) 충족.

Sub-action E2 — cargo clippy --workspace -- -D warnings

  • Files affected: 변경 0.
  • Action:
    cargo clippy --workspace --all-targets -j 1 -- -D warnings 2>&1 | tail -50
    
  • Acceptance:
    • exit code 0.
    • "warning" 키워드 0 (or -D warnings 가 자동 error 화).
    • spec §6 row 13 (workspace clippy clean) 충족.

Sub-action E3 — dogfood re-run (Ollama qwen2.5vl:3b 환경)

  • Files affected: 변경 0. /build/cache/tmp/v0.20-dogfood/ (isolated KB, 동일 dogfood 재사용).
  • Action (memory feedback_pr_workflow + _external/ invariant 따름):
    • (a) release build:

      cargo build --release -p kebab-cli -j 4 2>&1 | tail -10
      "${CARGO_TARGET_DIR:-target}/release/kebab" --version
      # expected: kebab 0.20.0
      
    • (b) dogfood KB clean + re-ingest 9 PDF (spec §1.1 의 dogfood 환경 동일).

      canonical config path = /build/cache/tmp/v0.20-dogfood/config.toml (§0 가정). 외부 backup file /build/cache/tmp/v0.20-dogfood-config.toml존재 안 함 — critic round 1 H-1 의 actual probe 결과. 따라서 config 의 자체 backup 후 clean + restore path 사용 (destructive rm -rf 시 config 동시 삭제 방지):

      # Step A: config 의 임시 backup (KB clean 전 보존).
      cp /build/cache/tmp/v0.20-dogfood/config.toml \
          /build/cache/tmp/v0.20-dogfood-config.toml.bak
      
      # Step B: KB 전체 clean (config 포함 destructive — backup 으로 보존됨).
      rm -rf /build/cache/tmp/v0.20-dogfood/
      mkdir -p /build/cache/tmp/v0.20-dogfood/
      
      # Step C: backup 에서 config restore.
      cp /build/cache/tmp/v0.20-dogfood-config.toml.bak \
          /build/cache/tmp/v0.20-dogfood/config.toml
      
      # config.toml 안 [ingest.pdf].enabled = true, ollama endpoint =
      # http://192.168.0.47:11434, ocr_model = qwen2.5vl:3b
      
      # Step D: ingest.
      "$RELEASE_BIN" ingest --config /build/cache/tmp/v0.20-dogfood/config.toml \
          --json --force-reingest 2>&1 | tee /build/cache/tmp/v0.20-dogfood-ingest.ndjson
      
      # Step E (optional): backup file cleanup. 다음 dogfood iteration 의 redundant
      # backup 누적 방지. config 자체는 v0.20-dogfood/config.toml 가 in-place
      # canonical, .bak 은 transient.
      rm /build/cache/tmp/v0.20-dogfood-config.toml.bak
      

      (대안 selective-delete 의 single-step path — config 보존 + 그 외 destructive):

      find /build/cache/tmp/v0.20-dogfood/ -mindepth 1 -not -name 'config.toml' \
          -exec rm -rf {} +
      

      실 procedure 는 plain 5-step 의 명료성 우선 (executor 의 default).

    • (c) acceptance:

      • spec §6 row 3: 9 PDF 의 skipped_size_exceeded == 0 for non-code (= 모두 0 — workspace 가 code 0).
      • spec §6 row 8: F1 + F2 의 kind != "Error" (chunk_id collision 부재).
      • spec §6 row 12: mojibake.pdf 의 ingest item block_count: 1.
      • spec §6 row 15: 9 PDF 모두 ingest, errors = 2 (encrypted only — pre-existing dogfood baseline 동일).
    • Ollama 미가용 시 fallback: endpoint 가 unreachable 면 본 sub-action 의 partial skip 가능 — workspace test (E1) + clippy (E2) 의 unit/integration 수준 evidence 로 spec §6 row 1, 2, 4-7, 9-11, 13-14, 16 충족 + dogfood row 3, 8, 12, 15 skip 1줄 record (commit body + result file).

  • Acceptance:
    • ingest report 의 ndjson 안 errors = 2 (encrypted only).
    • F1/F2/mojibake 각각의 item line kind field 가 success path (= "new" 또는 "unchanged", not "Error").
    • dogfood log path: /build/cache/tmp/v0.20-dogfood-ingest.ndjson (commit body 안 reference).

Sub-action E4 — commit 점검 + 최종 organize

  • Files affected: 모든 step 의 누적 changes.
  • Action:
    • git status + git log --oneline b4d9e60..HEAD — Step 1-4 의 4 commit + Step 5 의 verify-only commit 0 (verification 만, commit 없음).
    • 만약 work-in-progress 잔존 file 있으면 reset.
    • commit message 의 Co-Authored-By: line 점검 (CLAUDE.md gitea-pr workflow).
  • Acceptance:
    • git log --oneline b4d9e60..HEAD | wc -l = 4 (Step 1-4 의 각 1 commit).
    • git status 의 untracked + modified = 0.

Sub-action E5 — PR #189 force-push

  • Files affected: remote ref gitea/feat/pdf-scanned-ocr.
  • Action (gitea-ops skill 의 직접 호출 가능):
    git push gitea feat/pdf-scanned-ocr --force-with-lease
    
    • --force-with-lease — local 의 fetch state 와 remote HEAD 가 match 시에만 force-push (다른 collaborator 의 push 보호; 본 single-user 환경 cheap safety).
    • PR #189 의 body 갱신 — Bug #2/#3/#4 fix summary + dogfood evidence 추가 (gitea API PATCH /repos/altair823-org/kebab/pulls/189).
    • 사용자 memory feedback_pr_workflow 따라 gitea-pr-review skill 의 review 루프 진입 (multi-round critic + verifier).
  • Acceptance:
    • gh-equivalent (gitea-ops gitea-pr-status 189) 의 head SHA = local git rev-parse HEAD.
    • PR #189 의 commit count = 이전 force-push 시점 의 commit count + 4.
    • sequencing summary 의 5-commit table (§7) 와 final state 일치.

Commit (Step 5)

verification only — git commit 0. Step 1-4 의 4 commit 가 final tree.


§4 Verifier checklist (spec §6 16-row 1:1 mapping)

각 row 가 scriptable command. step 5 E1-E3 의 누적 실행으로 모두 가능.

# Verifier Bug step 명령
1 walker bypasses size cap for PDF #2 A3 cargo test -p kebab-source-fs size_cap_skips_only_code_files -j 4
2 walker still skips oversized code files #2 A3 cargo test -p kebab-source-fs ingest_report_counts_oversized_files_by_bytes -j 4
3 256KB+ PDF/markdown ingest default config #2 E3 dogfood: $RELEASE_BIN ingest ... 의 ingest report 의 skipped_size_exceeded = 0 for non-code
4 chunker collision regression test #3 B5 cargo test -p kebab-chunk multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids -j 4
5 chunker determinism preserved #3 B5 cargo test -p kebab-chunk deterministic_chunk_ids_1000 -j 4
6 chunker overlap clamp preserved #3 B5 cargo test -p kebab-chunk overlap_clamped_when_overlap_exceeds_target -j 4
7 integration: multi-scanned PDF ingest (conditional, §4.5.1) #3 C2 cargo test -p kebab-app multi_scanned_pdf_ingest_no_chunk_id_collision -j 4 (Option A/B downgrade path 시 skip + record)
8 dogfood: F1 + F2 force-reingest errors=0 #3 E3 dogfood: $RELEASE_BIN ingest --force-reingest ... 의 errors = 0 (encrypted 제외)
9 F4 fixture lopdf 1-page invariant #4 D3 cargo test -p kebab-parse-pdf mojibake_fixture_load_yields_one_page -j 4
10 F4 fixture ToUnicode 부재 invariant #4 D3 cargo test -p kebab-parse-pdf mojibake_fixture_has_no_tounicode_cmap -j 4
11 F4 PdfTextExtractor 1-block invariant #4 D3 cargo test -p kebab-parse-pdf pdf_text_extractor_on_mojibake_yields_one_block -j 4
12 dogfood: F4 ingest block_count=1 #4 E3 dogfood: mojibake.pdf 의 ingest item block_count: 1
13 workspace clippy clean all E2 cargo clippy --workspace --all-targets -j 1 -- -D warnings
14 workspace full test pass all E1 cargo test --workspace --no-fail-fast -j 1
15 dogfood end-to-end 9 PDF all E3 dogfood: 9 PDF 모두 ingest, errors = 2 (encrypted only)
16 chunker_version cascade final value #3 B3 grep -nE 'pdf-page-v[0-9.]+' crates/kebab-chunk/src/pdf_page_v1.rs 결과가 "pdf-page-v1.1"

executor 의 final step (E1-E3) 에서 16 row 모두 scriptable 실행 + result file 안 row-by-row pass/fail/skip 기록.

Workspace baseline expected test count delta (verifier round 1 M-1 closure)

cargo test --workspace -j 1 (Step 5 E1) 의 expected test result: ok. N passed 의 delta 산수 — pre-fix baseline 대비:

Step Sub-action new test name crate type
1 A3 size_cap_skips_only_code_files kebab-source-fs unit (in mod tests)
2 B5 multi_chunk_page_with_aggressive_overlap_produces_unique_chunk_ids kebab-chunk unit (in mod tests)
3 C2 multi_scanned_pdf_ingest_no_chunk_id_collision kebab-app integration (new test binary)
4 D3 mojibake_fixture_load_yields_one_page kebab-parse-pdf unit-style integration
4 D3 mojibake_fixture_has_no_tounicode_cmap kebab-parse-pdf unit-style integration
4 D3 pdf_text_extractor_on_mojibake_yields_one_block kebab-parse-pdf unit-style integration
  • Option A (full path, C2 active): total = +6 unit/integration test cases + 1 new integration test binary.
  • Option B (C2 conditional downgrade per §4.5.1): total = +5 test cases + 0 new binary.

기타: B3 의 chunker_version_is_pdf_page_v1 기존 test 의 assertion content 변경 없음 (VERSION_LABEL const 인용) → test count delta 0. D2 의 vector_pdf_extract_byte_identical_to_baseline 기존 test 의 assertion 결과만 변경 (fixture 변경 → baseline 변경) → test count delta 0, snapshot regen action 만 추가.

executor 가 E1 acceptance 의 N 비교 시 본 delta 산수 와 일치 확인 (regression 시 detection).


§5 Risks (plan 단계)

  • R-1 (MockOcrEngine sharing complexity): spec §4.5.1 의 Option A (tests/common/mock_ocr.rs lift) 가 기존 pdf_ocr_apply.rs:20-45 의 9 test 의 ctor migration 필요 — backward-compat ctor 2 개 (single + per_page) 도입 시 trivial, 실패 시 Option B (inline) downgrade. spec §6 row 7 conditional skip 가능.
  • R-2 (chunker_version bump cascade scope): pdf-page-v1.1 의 영향 = multi-chunk PDF page 의 chunk_id 변경. parser_version / embedding_version / prompt_template_version / index_version unchanged — kebab-eval::eval_runs.config_snapshot_json 의 5-version snapshot 의 chunker_version field 만 새 값. parent design §9 의 cascade rule invariant 보존, eval baseline 의 re-run 권장 (spec §7.1 Risk 1 의 user-facing note).
  • R-3 (F4 fixture binary churn): pikepdf 의 save output 가 reportlab+byte-edit 와 다른 PDF object ordering → SHA256 변경 + git binary diff noise. text_extractor_regression.rs baseline 도 새 fixture 의 actual output 으로 same-commit update — Step 4 D2 안 동시 처리.
  • R-4 (dogfood Ollama 의존): spec §6 row 3 + 8 + 12 + 15 dogfood acceptance 가 real 192.168.0.47:11434 qwen2.5vl:3b 호출. endpoint 미가용 시 unit/integration evidence (row 1-2, 4-7, 9-11, 13-14, 16) 로 partial closure + commit body / result file 안 skip record.
  • R-5 (pikepdf dependency install): Step 4 D1 의 mojibake.py 의 import pikepdf — 본 머신의 Python venv 에 pip install 필요. CI 의존성 미발생 (fixture commit 후 1회성 generation).
  • R-6 (parent plan 와의 동시 진행 충돌 0 확인): parent plan (2026-05-27-pdf-scanned-ocr-plan.md round 1c ACCEPT) 의 Step 11 (final verify + PR open) 가 이미 commit b4d9e60 으로 closed. 본 plan 의 fix commits 가 그 commit 위에 stack — branch ordering 충돌 0.

§6 Open questions deferred to executor

  • OQ-1 (MockOcrEngine sharing path): spec §4.5.1 의 Option A (tests/common/mock_ocr.rs lift) vs Option B (inline) 결정. executor 의 Step 3 C1 안 first action — probe grep -rn "impl OcrEngine" 후 결정 + result file 안 record. plan first preference = Option A.
  • OQ-2 (F4 baseline snapshot update tool): CLOSED (round 1c, critic M-2 + verifier H-4)text_extractor_regression.rs:59-64 의 actual pattern = hand-rolled unwrap_or_else { write baseline } (insta crate 사용 X). regen procedure = snapshot file tests/snapshots/vector_pdf_canonical.json 삭제 + cargo test 2회 (1st auto-regen, 2nd verify). cargo-insta CLI 불요. detail = §3 Step 4 D2 의 Action (c).
  • OQ-3 (pikepdf install command): pip install 의 cache-dir + venv 결정 — global --user pip 또는 fixture generation 전용 venv 또는 conda environment. plan 의 default = pip install --cache-dir /build/cache/pip pikepdf reportlab (memory feedback_disk_layout).
  • OQ-4 (dogfood config.toml 의 endpoint 변경 시점): 본 dogfood 환경의 192.168.0.47:11434 Ollama endpoint 가 변경되면 executor 가 alternative endpoint (localhost:11434 등) 로 override + result file 안 record.
  • OQ-5 (PR #189 review 루프의 round 수): memory feedback_pr_workflow 의 gitea-pr + 리뷰 루프 — round 1 critic + verifier 의 결과에 따라 round 2/2c 진입 가능. 본 plan 은 round 0 (drafter) — review round 의 outcome 은 plan 외 scope.

§7 Sequencing summary (logical commit boundaries)

commit # step range logical scope file count
1 Step 1 (A1+A2+A3) fix(source-fs): apply size limit only to code files; PDF/image/markdown bypass walker cap (Bug #2) 2
2 Step 2 (B1+B2+B3+B4+B5) fix(chunk): chunk_id collision under aggressive overlap; bump pdf-page-v1 → pdf-page-v1.1 (Bug #3) 4 (pdf_page_v1.rs + HOTFIXES.md + pdf_pipeline.rs:168 + :368, verifier H-1)
3 Step 3 (C1+C2) test(app): multi-scanned PDF chunk_id collision-free integration test (Bug #3 regression) 4 (Option A: existing common/mod.rs append + new common/mock_ocr.rs + modify pdf_ocr_apply.rs + new multi_scanned_pdf_ingest_no_chunk_id_collision.rs, verifier H-2) / 1 (Option B)
4 Step 4 (D1+D2+D3) fix(parse-pdf): F4 mojibake.pdf via pikepdf surgery; preserve 1-page invariant (Bug #4) 5 (mojibake.py + fixtures/mojibake.pdf + snapshots/vector_pdf_canonical.json + text_extractor_regression.rs (D3 append) + src/text_quality.rs:96 consumer verify, verifier H-4 + H-3 + NIT-2)
5 Step 5 (E1-E5) verification only — git commit 0; final state = commits 1-4 위 PR #189 force-push 0

총 4 commit + 1 verify-only step. force-push 후 PR #189 의 head = local HEAD.


§8 Round 1c rewrite changelog (drafter trace)

round 1 critic + verifier 의 합산 21 finding (critic 7 + verifier 14) 적용. detail 은 result file (.omc/reviews/2026-05-27-v0.20-bugfix-plan-drafter-r1c-result.md) 의 §1 traceability matrix 참조. 본 §8 은 plan body 의 substantive change summary.

Critic r1 (7 finding)

ID Severity Action Plan section
critic H-1 HIGH E3 dogfood config 의 backup 후 clean + restore 5-step procedure (외부 backup file 부재 reality 반영) §3 Step 5 E3 (b)
critic M-1 MEDIUM line 15 "17 sub-action" → "18 sub-action" §0 prelude line
critic M-2 MEDIUM D2 snapshot baseline 갱신 mechanic 명문 (hand-rolled unwrap_or_else pattern, OQ-2 closure) §3 Step 4 D2 + §6 OQ-2
critic L-1 LOW B1 line range "200-289" → "200-204 (doc) + 205-289 (body)" 명시 §3 Step 2 B1
critic L-2 LOW MockOcrEngine ctor count "9 test (existing)" → "10 instantiation site" (actual probe) §3 Step 3 C1 + C2
critic L-3 LOW D1 pre-action 에 DejaVuSans.ttf existence probe 1줄 추가 §3 Step 4 D1
critic NIT-1 NIT "5 logical commit" → "4 commit + 1 verify-only step (= 5 step total, 4 commit boundary)" §0 prelude line

Verifier r1 (14 finding)

ID Severity Action Plan section
verifier H-1 HIGH B3 sub-action 에 pdf_pipeline.rs:168 (hard assertion) + :368 (error message) literal 갱신 명시 + acceptance grep regex 정밀화 (grep -v 'pdf-page-v1\.1') §3 Step 2 B3
verifier H-2 HIGH Step 3 Option A 의 common/mod.rs 가 existing infrastructure 반영 — pub mod mock_ocr; 1줄 append + 신규 common/mock_ocr.rs + pdf_ocr_apply.rs lift + 신규 integration test = 4 file edit §3 Step 3 C1 + C2 + §7 commit 3 file count
verifier H-3 HIGH D3 file path text_extractor.rs 부재 정정 → text_extractor_regression.rs append (locality with D2 snapshot regen). include_bytes! path 도 ../tests/fixtures/...fixtures/... 직접 + CWD-relative std::fs::read 회피 §3 Step 4 D3
verifier H-4 HIGH D2 snapshot regen mechanic — snapshot file tests/snapshots/vector_pdf_canonical.json 삭제 + cargo test 2회 (1st auto-regen, 2nd verify) + src/text_quality.rs:96 2번째 consumer enumerate §3 Step 4 D2 + §7 commit 4 file count
verifier M-1 MEDIUM §4 verifier checklist 뒤에 expected workspace test count delta 산수 표 추가 (+6 unit + 1 integration, Option A / +5 + 0, Option B) §4 (sub-section)
verifier M-2 MEDIUM B2 acceptance phrasing 갱신 — "Step 2 commit time" 명시 + sub-action 별 grep 시점 명문 §3 Step 2 B2 acceptance
verifier M-3 MEDIUM C2 Option A 의 "기존 10 ctor site mechanical migration" 명령 명시 §3 Step 3 C1 (b)
verifier L-1 LOW pdf_page_v1.rs line range 200-289 → 205-289 (critic L-1 와 same edit pass) §3 Step 2 B1
verifier L-2 LOW caller line range 155-185 → 149-186 §3 Step 2 B2
verifier L-3 LOW B5 test scenario comment 의 target=1500 byte + overlap=240 byte 산수 보강 §3 Step 2 B5
verifier L-4 LOW ingest_pdf_ocr_smoke.rs 의 grep B3 scope safety 확인 (별도 action 0, finding 자체 = no action) (verified safe)
verifier NIT-1 NIT E1 의 df -h unit 처리 산수 정밀화 → df -BG --output=avail 으로 GB unit 강제 §3 Step 5 E1 (a)
verifier NIT-2 NIT Step 4 commit scope test-fixtureparse-pdf (crate name) §3 Step 4 commit
verifier NIT-3 NIT dogfood config canonical path 의 single-definition (in §0) + 모든 acceptance command 참조 §0 pre-flight + §3 Step 5 E3

Summary

  • frontmatter status draft (round 0)draft (round 1c).
  • frontmatter review_history 에 round 1 critic + verifier + round 1c rewrite 항목 3 줄 add.
  • plan body line 15 의 prelude statement 2 token 정정 (sub-action count + commit boundary 표현).
  • §0 pre-flight 에 dogfood KB layout 가정 1 bullet add.
  • §3 5 step 의 sub-action body 의 detail 보강 (file path / acceptance grep / mechanic / migration cost).
  • §4 verifier checklist 의 expected test count delta sub-section add.
  • §6 OQ-2 closure 표시 ( CLOSED).
  • §7 sequencing summary 의 file count 갱신 (commit 2: 2→4, commit 3: 3-4→4, commit 4: 3-4→5).
  • §8 round 1c rewrite changelog (본 단락) populate.

총 plan body line 변경 = ~+250 net add (round 0 698 line → round 1c ~950 line).