feat(kebab-parse-pdf): P7-1 text PDF extractor #37

altair823 · 2026-05-02T08:35:48Z

altair823 commented

2026-05-02 08:35:48 +00:00

요약

P7-1 kebab-parse-pdf 신규 crate. MediaType::Pdf 자산을 페이지 단위로 추출해 Block::Paragraph + SourceSpan::Page 로 emit 하는 Extractor 구현. 본문이 비거나 추출 단계에서 panic 한 페이지는 빈 paragraph + Provenance::Warning ("scanned candidate") 으로 표시해, 이후 OCR fallback task 의 입력 신호로 사용.

핵심 결정

per-page API: pdf-extract 0.7 가 whole-doc 만 노출하므로 lopdf::Document::extract_text(&[page]) 를 사용. lopdf 의 알려진 malformed-page panic 은 std::panic::catch_unwind 로 흡수해 recoverable warning 으로 변환.
encrypted PDF: Document::is_encrypted() 체크 후 즉시 에러 (qpdf --decrypt 안내). v1 에서 decryption 시도 없음.
/Info Title UTF-16BE BOM: 비-ASCII Title (한국어 등) 은 PDF spec 상 0xFE 0xFF BOM + UTF-16BE 로 인코딩됨 — info::pdf_string 이 BOM 감지 후 from_utf16_lossy 로 디코드.
테스트 fixture in-memory 생성: lopdf 빌더 헬퍼 (tests/common/mod.rs:build_text_pdf) 로 매 테스트마다 PDF 바이트 합성. 커밋된 binary fixture 없음 (P6 의 컨벤션 계승).

테스트 (9개)

3-page → 페이지별 SourceSpan::Page + char_start/end 검증
scanned-mixed (page 2 무 contents) → 빈 paragraph + 정확히 1개 Warning
encrypted PDF → 에러 + remediation 메시지
corrupt header → 에러
page_count 메타 정확
UTF-16BE Title 디코드 ("케밥 문서")
/Info 부재 시 filename fallback
determinism (동일 바이트 → 동일 JSON)
snapshot 핵심 shape stable

Spec 매핑

원본 spec: tasks/p7/p7-1-pdf-text-extractor.md
디자인 §3.4 (SourceSpan::Page, Block::Paragraph), §9.2 (PDF text extraction), §9 versioning
parser_version = "pdf-text-v1"

검증

cargo test -p kebab-parse-pdf 9 passed
cargo clippy -p kebab-parse-pdf --all-targets -- -D warnings clean
cargo check --workspace clean

Out of scope (spec 명시)

스캔 PDF OCR fallback (별도 후속 task — P6-2 OCR adapter 재사용 예정)
멀티-컬럼 reading order / 표 / 수식 / 폼 필드 / 북마크
본문 다국어 (CID 폰트 의존) — Title 메타데이터 경로만 v1 커버

Test plan

단위 9개 통과
clippy -D warnings 통과
workspace check 통과
사용자 수동 검수 후 실제 PDF 로 ingest 시도 (P7-2 + app wiring 머지 후 가능)

## 요약 P7-1 `kebab-parse-pdf` 신규 crate. `MediaType::Pdf` 자산을 페이지 단위로 추출해 `Block::Paragraph` + `SourceSpan::Page` 로 emit 하는 `Extractor` 구현. 본문이 비거나 추출 단계에서 panic 한 페이지는 빈 paragraph + `Provenance::Warning` ("scanned candidate") 으로 표시해, 이후 OCR fallback task 의 입력 신호로 사용. ## 핵심 결정 - **per-page API**: `pdf-extract 0.7` 가 whole-doc 만 노출하므로 `lopdf::Document::extract_text(&[page])` 를 사용. lopdf 의 알려진 malformed-page panic 은 `std::panic::catch_unwind` 로 흡수해 recoverable warning 으로 변환. - **encrypted PDF**: `Document::is_encrypted()` 체크 후 즉시 에러 (`qpdf --decrypt` 안내). v1 에서 decryption 시도 없음. - **/Info Title UTF-16BE BOM**: 비-ASCII Title (한국어 등) 은 PDF spec 상 `0xFE 0xFF` BOM + UTF-16BE 로 인코딩됨 — `info::pdf_string` 이 BOM 감지 후 `from_utf16_lossy` 로 디코드. - **테스트 fixture in-memory 생성**: `lopdf` 빌더 헬퍼 (`tests/common/mod.rs:build_text_pdf`) 로 매 테스트마다 PDF 바이트 합성. 커밋된 binary fixture 없음 (P6 의 컨벤션 계승). ## 테스트 (9개) - 3-page → 페이지별 `SourceSpan::Page` + `char_start/end` 검증 - scanned-mixed (page 2 무 contents) → 빈 paragraph + 정확히 1개 Warning - encrypted PDF → 에러 + remediation 메시지 - corrupt header → 에러 - page_count 메타 정확 - UTF-16BE Title 디코드 (`"케밥 문서"`) - /Info 부재 시 filename fallback - determinism (동일 바이트 → 동일 JSON) - snapshot 핵심 shape stable ## Spec 매핑 - 원본 spec: `tasks/p7/p7-1-pdf-text-extractor.md` - 디자인 §3.4 (`SourceSpan::Page`, `Block::Paragraph`), §9.2 (PDF text extraction), §9 versioning - `parser_version = "pdf-text-v1"` ## 검증 - `cargo test -p kebab-parse-pdf` 9 passed - `cargo clippy -p kebab-parse-pdf --all-targets -- -D warnings` clean - `cargo check --workspace` clean ## Out of scope (spec 명시) - 스캔 PDF OCR fallback (별도 후속 task — P6-2 OCR adapter 재사용 예정) - 멀티-컬럼 reading order / 표 / 수식 / 폼 필드 / 북마크 - 본문 다국어 (CID 폰트 의존) — Title 메타데이터 경로만 v1 커버 ## Test plan - [x] 단위 9개 통과 - [x] clippy `-D warnings` 통과 - [x] workspace check 통과 - [ ] 사용자 수동 검수 후 실제 PDF 로 ingest 시도 (P7-2 + app wiring 머지 후 가능)

altair823 added 1 commit 2026-05-02 08:35:49 +00:00

feat(kebab-parse-pdf): P7-1 text PDF extractor — per-page CanonicalDocument 5a158d7343

`PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출.
페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }`
emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning`
("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력.

핵심 동작:
- `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러
  (`qpdf --decrypt` 안내).
- 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed
  page panic 을 recoverable warning 으로 변환.
- `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM
  prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리).
- 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse,
  corrupt header error, page_count 메타, UTF-16BE Title, filename
  fallback, determinism, snapshot.

`parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7`
(원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude-reviewer-01 requested changes 2026-05-02 08:38:56 +00:00

claude-reviewer-01 left a comment

회차 1 — dep 정리 + Latin-1 디코드 정확성 + 테스트 커버리지 갭 위주. 실코드 경로 자체는 견고합니다 (catch_unwind 방어, in-memory fixture, deterministic provenance shape). 머지 전 7 건 inline 검토 부탁.

crates/kebab-parse-pdf/Cargo.toml

						
				@@ -0,0 +8,4 @@

				description   = "Text PDF extractor (per-page text + page citation) for the kebab pipeline (P7-1)"

				[dependencies]

				kebab-core   = { path = "../kebab-core" }

claude-reviewer-01 commented

2026-05-02 08:38:56 +00:00

(issue / dep 정리) 다음 deps 가 라이브러리 / 테스트 어디에서도 import 되지 않습니다 — cargo check -p kebab-parse-pdf 후 target/.../kebab_parse_pdf-* 산출물에 이름이 등장하지 않고, grep -rn '<name>' crates/kebab-parse-pdf/{src,tests} 가 0 건입니다:

kebab-config (line 12) — runtime dep 인데 lib 에서 한 번도 가져오지 않음. P7-1 spec § Allowed deps 에는 kebab-config 가 명시돼 있지만 v1 구현이 실제로 필요로 하지 않으므로 일단 빼는 것이 정직합니다 (필요할 때 다시 추가).
thiserror (line 18) — anyhow 만 사용 중이라 dead.
pdf-extract (line 25) — 가장 무거운 dead dep. 본 PR diff Cargo.lock 부분에 보이듯 pdf-extract 는 pom, postscript, type1-encoding-parser, adobe-cmap-parser, euclid, chrono, md5, linked-hash-map 등 8 개 이상의 transitive crate 를 새로 끌어옵니다. 본 task 는 lopdf 만으로 작동하므로 (per-page API 가 lopdf 에 있고, sanity check 도 lopdf::load_mem 의 결과로 충분), pdf-extract 는 후속 task (스캔 PDF OCR fallback) 에서 필요해질 때 추가하는 게 맞습니다.
tempfile (line 28, dev-dep) — 테스트가 disk 를 만들지 않고 모두 in-memory bytes 로 동작하므로 dead.
serde_json (line 30, dev-dep) — 이미 라인 14 의 regular dep 가 dev test 에 쓰이므로 dev-dep 중복.
serde (line 14) — lib.rs 가 serde_json 만 import 하고 serde 본체는 한 번도 가져오지 않습니다. derive 매크로는 kebab-core 가 자체적으로 사용 중. 빼도 안전.

Why: dead dep 5 개가 transitive 로 ~150 crate 를 들여와 cold build 시간 + target/ 디스크 풋프린트 (workspace 가 이미 6–10 GB 인 상태) 를 부풀립니다. CLAUDE.md 의 -j 1 풀-스위트 빌드가 메모리 한계를 친 이력과 맞물려 무시할 수 없는 비용입니다.

How to apply: 위 6 개 라인 제거 후 cargo check -p kebab-parse-pdf && cargo test -p kebab-parse-pdf 재확인. spec 의 Allowed deps 는 "가질 수 있는 것" 의 상한 선언이므로, 실제로 쓰지 않는 것을 쓰는 것이 spec 위반은 아닙니다.

(issue / dep 정리) 다음 deps 가 라이브러리 / 테스트 어디에서도 import 되지 않습니다 — `cargo check -p kebab-parse-pdf` 후 `target/.../kebab_parse_pdf-*` 산출물에 이름이 등장하지 않고, `grep -rn '<name>' crates/kebab-parse-pdf/{src,tests}` 가 0 건입니다: - `kebab-config` (line 12) — runtime dep 인데 lib 에서 한 번도 가져오지 않음. P7-1 spec § Allowed deps 에는 `kebab-config` 가 명시돼 있지만 v1 구현이 실제로 필요로 하지 않으므로 일단 빼는 것이 정직합니다 (필요할 때 다시 추가). - `thiserror` (line 18) — anyhow 만 사용 중이라 dead. - `pdf-extract` (line 25) — 가장 무거운 dead dep. 본 PR diff Cargo.lock 부분에 보이듯 `pdf-extract` 는 `pom`, `postscript`, `type1-encoding-parser`, `adobe-cmap-parser`, `euclid`, `chrono`, `md5`, `linked-hash-map` 등 8 개 이상의 transitive crate 를 새로 끌어옵니다. 본 task 는 lopdf 만으로 작동하므로 (per-page API 가 lopdf 에 있고, sanity check 도 lopdf::load_mem 의 결과로 충분), `pdf-extract` 는 후속 task (스캔 PDF OCR fallback) 에서 필요해질 때 추가하는 게 맞습니다. - `tempfile` (line 28, dev-dep) — 테스트가 disk 를 만들지 않고 모두 in-memory bytes 로 동작하므로 dead. - `serde_json` (line 30, dev-dep) — 이미 라인 14 의 regular dep 가 dev test 에 쓰이므로 dev-dep 중복. - `serde` (line 14) — `lib.rs` 가 `serde_json` 만 import 하고 `serde` 본체는 한 번도 가져오지 않습니다. derive 매크로는 kebab-core 가 자체적으로 사용 중. 빼도 안전. Why: dead dep 5 개가 transitive 로 ~150 crate 를 들여와 cold build 시간 + target/ 디스크 풋프린트 (workspace 가 이미 6–10 GB 인 상태) 를 부풀립니다. CLAUDE.md 의 `-j 1` 풀-스위트 빌드가 메모리 한계를 친 이력과 맞물려 무시할 수 없는 비용입니다. How to apply: 위 6 개 라인 제거 후 `cargo check -p kebab-parse-pdf && cargo test -p kebab-parse-pdf` 재확인. spec 의 Allowed deps 는 "가질 수 있는 것" 의 상한 선언이므로, 실제로 쓰지 않는 것을 쓰는 것이 spec 위반은 아닙니다.

crates/kebab-parse-pdf/src/info.rs Outdated

						
				@@ -0,0 +2,4 @@

				//!

				//! PDFs may carry a `/Info` trailer dictionary with `Title`,

				//! `Producer`, `Creator`, etc. Strings are encoded as either

				//! PDFDocEncoding (Latin-1 superset) OR UTF-16BE prefixed with the

claude-reviewer-01 commented

2026-05-02 08:38:56 +00:00

(nit) doc 코멘트의 "PDFDocEncoding (Latin-1 superset)" 는 부정확합니다. PDFDocEncoding 은 Latin-1 의 superset 이 아닙니다 — 0x18–0x1F 와 0x80–0x9F 영역에서 Latin-1 과 다른 매핑을 가집니다 (예: 0x18=breve, 0x80=bullet, 0x95=ellipsis 등). "Latin-1 과 0x20–0x7E + 0xA0–0xFF 범위에서 호환" 정도가 정확합니다.

How to apply: 위 dep cleanup + 본 코멘트의 위 issue 와 함께 한 번에 수정. 한 줄짜리 표현 정정이라 PR scope 안에서 처리해도 무방.

(nit) doc 코멘트의 "PDFDocEncoding (Latin-1 superset)" 는 부정확합니다. PDFDocEncoding 은 Latin-1 의 superset 이 아닙니다 — 0x18–0x1F 와 0x80–0x9F 영역에서 Latin-1 과 다른 매핑을 가집니다 (예: 0x18=breve, 0x80=bullet, 0x95=ellipsis 등). "Latin-1 과 0x20–0x7E + 0xA0–0xFF 범위에서 호환" 정도가 정확합니다. How to apply: 위 dep cleanup + 본 코멘트의 위 issue 와 함께 한 번에 수정. 한 줄짜리 표현 정정이라 PR scope 안에서 처리해도 무방.

crates/kebab-parse-pdf/src/info.rs

						
				@@ -0,0 +65,4 @@

				    // about, and Latin-1 is byte-identical to UTF-8 only for ASCII;

				    // `from_utf8_lossy` is the conservative call here. ASCII-only

				    // PDFs (the common case) round-trip cleanly.

				    let s = String::from_utf8_lossy(bytes).into_owned();

claude-reviewer-01 commented

2026-05-02 08:38:56 +00:00

(issue) from_utf8_lossy 는 Latin-1 디코더가 아닙니다. PDFDocEncoding 의 0x80–0xFF 영역 (액센트 / 라틴 확장 문자) 는 대부분 invalid UTF-8 단일 바이트라 \u{FFFD} (replacement char) 로 치환됩니다. 즉 BOM 없이 PDFDocEncoding 으로 인코딩된 비-ASCII Title (예: ISO-8859-1 기반 "Café") 는 "C?fé" 같은 깨진 결과가 나옵니다.

Why: 현실의 PDF 는 거의 다 UTF-16BE BOM 경로를 쓰지만 (앞 분기에서 처리), 일부 레거시 PDF (특히 1990s–2000s 의 LaTeX 출력) 는 BOM 없는 PDFDocEncoding 으로 Title 을 씁니다. 그 경우 잘못된 Title 이 metadata 에 저장되고, downstream search/RAG 에 노출됩니다 (filename fallback 으로 빠지지도 않음 — .is_empty() 체크는 통과하므로).

How to apply: PDFDocEncoding 은 0x00–0x7F 가 ASCII 와 동일하고 0x80–0xFF 는 (몇 개의 예외를 제외하고) Latin-1 / Unicode codepoint 와 동일합니다. byte → char 직접 캐스팅이 ASCII 케이스를 깨지 않으면서 Latin-1 케이스를 살립니다:

let s: String = bytes.iter().map(|&b| b as char).collect();
if s.is_empty() { None } else { Some(s) }

PDFDocEncoding 의 0x18–0x1F 7 byte 예외 (breve, caron 등) 는 정확하지 않게 매핑되지만 — 이것은 풀-퀄리티 PDFDocEncoding 디코더가 아닌 best-effort 의 한계로서 받아들일 수 있습니다 (현 fallback 도 그 영역을 정확히 매핑하지 않으므로 regression 없음). 대부분의 경우 (라틴 액센트 문자) 정확히 동작.

(issue) `from_utf8_lossy` 는 Latin-1 디코더가 아닙니다. PDFDocEncoding 의 0x80–0xFF 영역 (액센트 / 라틴 확장 문자) 는 대부분 invalid UTF-8 단일 바이트라 `\u{FFFD}` (replacement char) 로 치환됩니다. 즉 BOM 없이 PDFDocEncoding 으로 인코딩된 비-ASCII Title (예: ISO-8859-1 기반 "Café") 는 `"C?fé"` 같은 깨진 결과가 나옵니다. Why: 현실의 PDF 는 거의 다 UTF-16BE BOM 경로를 쓰지만 (앞 분기에서 처리), 일부 레거시 PDF (특히 1990s–2000s 의 LaTeX 출력) 는 BOM 없는 PDFDocEncoding 으로 Title 을 씁니다. 그 경우 잘못된 Title 이 metadata 에 저장되고, downstream search/RAG 에 노출됩니다 (filename fallback 으로 빠지지도 않음 — `.is_empty()` 체크는 통과하므로). How to apply: PDFDocEncoding 은 0x00–0x7F 가 ASCII 와 동일하고 0x80–0xFF 는 (몇 개의 예외를 제외하고) Latin-1 / Unicode codepoint 와 동일합니다. byte → char 직접 캐스팅이 ASCII 케이스를 깨지 않으면서 Latin-1 케이스를 살립니다: ```rust let s: String = bytes.iter().map(|&b| b as char).collect(); if s.is_empty() { None } else { Some(s) } ``` PDFDocEncoding 의 0x18–0x1F 7 byte 예외 (breve, caron 등) 는 정확하지 않게 매핑되지만 — 이것은 풀-퀄리티 PDFDocEncoding 디코더가 아닌 best-effort 의 한계로서 받아들일 수 있습니다 (현 fallback 도 그 영역을 정확히 매핑하지 않으므로 regression 없음). 대부분의 경우 (라틴 액센트 문자) 정확히 동작.

crates/kebab-parse-pdf/src/lib.rs Outdated

						
				@@ -0,0 +133,4 @@

				            };

				            // ordinal = page - 1; saturating_sub guards the (shouldn't-happen)

				            // case where lopdf hands back a 0-indexed page key.

				            let ordinal = page_num.saturating_sub(1);

claude-reviewer-01 commented

2026-05-02 08:38:56 +00:00

(issue) saturating_sub(1) 가 "shouldn't-happen" 한 page=0 케이스를 silent 로 흡수하는데, 그 경로에서 page=0 과 page=1 둘 다 ordinal=0 으로 매핑되어 id_for_block 결과가 같아집니다 (heading_path 가 비어 있고 source_span 도 둘 다 SourceSpan::Page 이지만 page 필드만 다르므로 — span 은 다름, 따라서 collision 은 사실 안 일어남). 실제로 collision 은 안 나지만 ordinal 이 의미 없는 "실제 페이지 번호와 다른 값" 이 되어 chunk 단계에서 ordinal 기준 정렬을 했을 때 page=1 (ordinal 0) 과 page=0 (ordinal 0) 의 순서가 비결정적이 됩니다.

Why: lopdf 0.32 가 1-based 를 보장하지만 미래 버전 / 손상된 PDF 에서 page=0 이 새어 들어올 가능성을 spec 이 명시적으로 가드하라고 지시한 건 아닙니다. 그러나 silent 흡수보다는 명시적 신호가 디버깅 친화적입니다.

How to apply (둘 중 택일):

옵션 A — debug_assert 로 invariant 만 명시:

debug_assert!(page_num >= 1, "lopdf get_pages() returned 0-based page key");
let ordinal = page_num.saturating_sub(1);

옵션 B — release 에서도 가드, 0 페이지를 Warning 으로 떨어뜨림:

if page_num == 0 {
    events.push(ProvenanceEvent { ..., kind: Warning, note: Some("lopdf yielded page=0 (skipped)".into()) });
    continue;
}
let ordinal = page_num - 1;

옵션 A 가 본 PR scope 에 더 fit 합니다 (현재 saturating 가 코드 noise 인 vs assert 1 줄).

(issue) `saturating_sub(1)` 가 "shouldn't-happen" 한 page=0 케이스를 silent 로 흡수하는데, 그 경로에서 page=0 과 page=1 둘 다 ordinal=0 으로 매핑되어 `id_for_block` 결과가 같아집니다 (heading_path 가 비어 있고 source_span 도 둘 다 SourceSpan::Page 이지만 page 필드만 다르므로 — span 은 다름, 따라서 collision 은 사실 안 일어남). 실제로 collision 은 안 나지만 ordinal 이 의미 없는 "실제 페이지 번호와 다른 값" 이 되어 chunk 단계에서 ordinal 기준 정렬을 했을 때 page=1 (ordinal 0) 과 page=0 (ordinal 0) 의 순서가 비결정적이 됩니다. Why: lopdf 0.32 가 1-based 를 보장하지만 미래 버전 / 손상된 PDF 에서 page=0 이 새어 들어올 가능성을 spec 이 명시적으로 가드하라고 지시한 건 아닙니다. 그러나 silent 흡수보다는 명시적 신호가 디버깅 친화적입니다. How to apply (둘 중 택일): 옵션 A — debug_assert 로 invariant 만 명시: ```rust debug_assert!(page_num >= 1, "lopdf get_pages() returned 0-based page key"); let ordinal = page_num.saturating_sub(1); ``` 옵션 B — release 에서도 가드, 0 페이지를 Warning 으로 떨어뜨림: ```rust if page_num == 0 { events.push(ProvenanceEvent { ..., kind: Warning, note: Some("lopdf yielded page=0 (skipped)".into()) }); continue; } let ordinal = page_num - 1; ``` 옵션 A 가 본 PR scope 에 더 fit 합니다 (현재 saturating 가 코드 noise 인 vs assert 1 줄).

crates/kebab-parse-pdf/src/page_text.rs

						
				@@ -0,0 +8,4 @@

				pub(crate) fn extract_one(doc: &lopdf::Document, page: u32) -> anyhow::Result<String> {

				    let result = catch_unwind(AssertUnwindSafe(|| doc.extract_text(&[page])))

				        .map_err(|_| anyhow::anyhow!("panic during lopdf::Document::extract_text"))?;

claude-reviewer-01 commented

2026-05-02 08:38:56 +00:00

(칭찬) catch_unwind + AssertUnwindSafe 로 lopdf 의 historical malformed-page panic 을 recoverable Err 로 둔갑시킨 게 좋습니다. spec § Behavior contract 의 "wrap with catch_unwind to absorb the rare crash on malformed pages" 요구를 정확히 한 줄로 표현했고, 호출부 (lib.rs:115) 의 match arm 이 panic / Err / Ok-empty 셋을 같은 "scanned candidate" 채널로 묶어 운영자 관점에서 ��관됩니다. 향후 release 프로파일을 panic=abort 로 옮길 때만 caveat — 그때는 catch_unwind 가 작동하지 않으므로 본 함수 헤더 doc 에 "panic=unwind 전제" 한 줄을 더하면 미래 reader 에게 친절.

(칭찬) `catch_unwind` + `AssertUnwindSafe` 로 lopdf 의 historical malformed-page panic 을 recoverable `Err` 로 둔갑시킨 게 좋습니다. spec § Behavior contract 의 "wrap with `catch_unwind` to absorb the rare crash on malformed pages" 요구를 정확히 한 줄로 표현했고, 호출부 (lib.rs:115) 의 `match` arm 이 panic / Err / Ok-empty 셋을 같은 "scanned candidate" 채널로 묶어 운영자 관점에서 ��관됩니다. 향후 release 프로파일을 `panic=abort` 로 옮길 때만 caveat — 그때는 catch_unwind 가 작동하지 않으므로 본 함수 헤더 doc 에 "panic=unwind 전제" 한 줄을 더하면 미래 reader 에게 친절.

crates/kebab-parse-pdf/tests/common/mod.rs

						
				@@ -0,0 +39,4 @@

				    let font_id = doc.add_object(dictionary! {

				        "Type" => "Font",

				        "Subtype" => "Type1",

				        "BaseFont" => "Helvetica",

claude-reviewer-01 commented

2026-05-02 08:38:56 +00:00

(칭찬) lopdf 빌더로 in-memory PDF 를 합성해 쓰는 패턴이 P6 의 kebab-parse-image::tests::common 에서 정립한 "binary fixture 커밋 안 함" 컨벤션을 그대로 잇습니다. pages: &[Option<&str>] 시그니처가 "있는 페이지" / "비어 있는 (Contents 없음) 페이지" 두 케이스를 한 슬라이스로 표현해 spec 의 "scanned-mixed PDF" 테스트 fixture 를 만들기에 깔끔합니다 — 호출부 (tests/extractor.rs:empty_page_emits_warning_and_empty_paragraph) 가 한 줄로 의도를 드러냅니다.

(칭찬) `lopdf` 빌더로 in-memory PDF 를 합성해 쓰는 패턴이 P6 의 `kebab-parse-image::tests::common` 에서 정립한 "binary fixture 커밋 안 함" 컨벤션을 그대로 잇습니다. `pages: &[Option<&str>]` 시그니처가 "있는 페이지" / "비어 있는 (Contents 없음) 페이지" 두 케이스를 한 슬라이스로 표현해 spec 의 "scanned-mixed PDF" 테스트 fixture 를 만들기에 깔끔합니다 — 호출부 (`tests/extractor.rs:empty_page_emits_warning_and_empty_paragraph`) 가 한 줄로 의도를 드러냅니다.

crates/kebab-parse-pdf/tests/extractor.rs Outdated

claude-reviewer-01 commented

2026-05-02 08:38:56 +00:00

(suggestion / coverage 갭) 현재 UTF-16BE 디코드 테스트는 "케밥 문서" — 모두 BMP 영역 (단일 u16 unit) 입니다. surrogate pair 를 거치는 case (예: emoji, secondary plane CJK) 는 커버되지 않아 String::from_utf16_lossy 의 surrogate 처리 경로가 한 번도 실행되지 않습니다.

Why: info::pdf_string 의 BOM 분기가 surrogate pair 를 깨면 — 예를 들어 chunks_exact(2) 까지는 맞지만 페어를 잘못 합치면 — UTF-16 BOM 디코드 자체가 손상됩니다. BMP 만 있는 테스트로는 잡히지 않습니다.

How to apply: 한 줄 추가로 충분. 예:

let info = InfoDict {
    title: Some(utf16be_bom("케밥 🥙 문서")),  // 🥙 = U+1F959, surrogate pair
    ...
};

그리고 assert_eq!(doc.title, "케밥 🥙 문서");. 기존 테스트를 확장만 해도 되고, 별도 info_dict_title_utf16be_surrogate_pair 테스트로 분리해도 무방.

(suggestion / coverage 갭) 현재 UTF-16BE 디코드 테스트는 "케밥 문서" — 모두 BMP 영역 (단일 u16 unit) 입니다. surrogate pair 를 거치는 case (예: emoji, secondary plane CJK) 는 커버되지 않아 `String::from_utf16_lossy` 의 surrogate 처리 경로가 한 번도 실행되지 않습니다. Why: `info::pdf_string` 의 BOM 분기가 surrogate pair 를 깨면 — 예를 들어 chunks_exact(2) 까지는 맞지만 페어를 잘못 합치면 — UTF-16 BOM 디코드 자체가 손상됩니다. BMP 만 있는 테스트로는 잡히지 않습니다. How to apply: 한 줄 추가로 충분. 예: ```rust let info = InfoDict { title: Some(utf16be_bom("케밥 🥙 문서")), // 🥙 = U+1F959, surrogate pair ... }; ``` 그리고 `assert_eq!(doc.title, "케밥 🥙 문서");`. 기존 테스트를 확장만 해도 되고, 별도 `info_dict_title_utf16be_surrogate_pair` 테스트로 분리해도 무방.

altair823 added 1 commit 2026-05-02 08:40:43 +00:00

review(p7-1): 회차 1 지적 반영 8de08cf38c

- Cargo.toml: 사용하지 않는 deps 제거 (`kebab-config`, `thiserror`,
  `pdf-extract`, dev `tempfile` / `serde_json` / `serde`). 특히
  `pdf-extract` 가 끌어오던 transitive ~150 crate (pom, postscript,
  type1-encoding-parser, adobe-cmap-parser, euclid, chrono, md5,
  linked-hash-map …) 가 모두 사라짐. lopdf 만 남음.
- info.rs: BOM 없는 PDFDocEncoded Title 디코드 버그 수정. `from_utf8_lossy`
  는 0x80–0xFF 를 U+FFFD 로 치환해 "Café" 같은 레거시 타이틀을 망가뜨림.
  byte → `char` 직접 캐스팅 (Latin-1 디코더) 로 교체. 회귀 테스트
  `info_dict_title_pdfdocencoding_latin1_high_bytes_decoded` 추가.
- info.rs: 모듈 doc 의 "Latin-1 superset" 부정확 표현 정정 — PDFDocEncoding
  은 0x18–0x1F / 0x80–0x9F 영역에서 Latin-1 과 다름.
- lib.rs: `saturating_sub(1)` 가 page=0 케이스를 silent 흡수하던 부분에
  `debug_assert!` 추가. release 는 saturating fallback 유지 (panic 보다
  garbled order 가 운영에 유리).
- tests: UTF-16 surrogate pair 커버리지 갭 보완 — 🥙 (U+1F959) 가 포함된
  타이틀로 `String::from_utf16_lossy` 의 페어-결합 경로 검증.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude-reviewer-01 approved these changes 2026-05-02 08:42:10 +00:00

claude-reviewer-01 left a comment

회차 2 — 회차 1 지적 4건 (dep 6개 정리, Latin-1 디코드 버그, 모듈 doc 정확성, saturating_sub assert) 모두 반영 + UTF-16 surrogate pair / Latin-1 high-byte 회귀 테스트 2개 추가. clippy clean, 11/11 tests pass. 머지 가능.

crates/kebab-parse-pdf/Cargo.toml

						
				@@ -0,0 +21,4 @@

				# at v1 (we don't call its whole-doc API), and the future scanned-PDF

				# OCR fallback can re-add it when it actually needs it.

				lopdf        = "0.32"

claude-reviewer-01 commented

2026-05-02 08:42:10 +00:00

(칭찬) pdf-extract 미사용 dep 를 미리 코멘트로 막아둔 게 좋습니다. "실제로 호출하지 않는다" + "OCR fallback task 가 필요로 할 때 다시 추가하라" 두 줄이 미래에 누군가 pdf-extract 를 reflexively 다시 넣으려고 할 때 동기를 멈추게 만듭니다. 본 워크스페이스의 dep 비대 비용 (workspace target/ 6–10 GB, 18 통합 테스트 바이너리 link 메모리 한계) 을 고려하면 이런 "왜 빠져 있는지" 코멘트가 단순한 dep 제거보다 가치가 큽니다.

(칭찬) `pdf-extract` 미사용 dep 를 미리 코멘트로 막아둔 게 좋습니다. "실제로 호출하지 않는다" + "OCR fallback task 가 필요로 할 때 다시 추가하라" 두 줄이 미래에 누군가 `pdf-extract` 를 reflexively 다시 넣으려고 할 때 동기를 멈추게 만듭니다. 본 워크스페이스의 dep 비대 비용 (workspace target/ 6–10 GB, 18 통합 테스트 바이너리 link 메모리 한계) 을 고려하면 이런 "왜 빠져 있는지" 코멘트가 단순한 dep 제거보다 가치가 큽니다.

crates/kebab-parse-pdf/src/info.rs

claude-reviewer-01 commented

2026-05-02 08:42:10 +00:00

(칭찬) bytes.iter().map(|&b| b as char).collect() — Rust 의 byte → char 캐스팅이 그대로 Latin-1 코드 포인트와 일치한다는 사실을 한 줄로 활용한 게 깔끔합니다. 추가 dep (encoding_rs 등) 없이 0xA0–0xFF 영역의 PDFDocEncoded 타이틀을 정확히 살리면서 ASCII 라운드트립도 그대로 유지. 회귀 테스트 (info_dict_title_pdfdocencoding_latin1_high_bytes_decoded) 가 정확히 이 분기를 커버해 향후 누군가 from_utf8_lossy 로 되돌리면 즉시 빨개집니다.

(칭찬) `bytes.iter().map(|&b| b as char).collect()` — Rust 의 byte → char 캐스팅이 그대로 Latin-1 코드 포인트와 일치한다는 사실을 한 줄로 활용한 게 깔끔합니다. 추가 dep (encoding_rs 등) 없이 0xA0–0xFF 영역의 PDFDocEncoded 타이틀을 정확히 살리면서 ASCII 라운드트립도 그대로 유지. 회귀 테스트 (`info_dict_title_pdfdocencoding_latin1_high_bytes_decoded`) 가 정확히 이 분기를 커버해 향후 누군가 `from_utf8_lossy` 로 되돌리면 즉시 빨개집니다.

crates/kebab-parse-pdf/src/lib.rs

						
				@@ -0,0 +135,4 @@

				            // collapse two pages onto the same ordinal (silently breaking

				            // ordinal-based sorting downstream), so we assert the

				            // invariant in dev builds. The release fallback still uses

				            // saturating_sub so a future lopdf regression degrades to

claude-reviewer-01 commented

2026-05-02 08:42:10 +00:00

(칭찬) debug_assert! + release fallback 의 조합이 디버깅 친화성과 운영 견고성을 한 번에 잡습니다. dev/test 에서는 lopdf 가 contract 위반 시 즉시 panic 으로 잡히고, release 에서는 saturating_sub 가 garbled order 로 degrade — 운영 중 한 페이지 순서가 어긋나는 게 panic 으로 ingest 전체가 멈추는 것보다 낫다는 판단이 P6-4 의 lenient policy (extract 성공 → 저장, OCR/caption 실패 → Warning) 와 톤이 같습니다.

(칭찬) `debug_assert!` + release fallback 의 조합이 디버깅 친화성과 운영 견고성을 한 번에 잡습니다. dev/test 에서는 lopdf 가 contract 위반 시 즉시 panic 으로 잡히고, release 에서는 saturating_sub 가 garbled order 로 degrade — 운영 중 한 페이지 순서가 어긋나는 게 panic 으로 ingest 전체가 멈추는 것보다 낫다는 판단이 P6-4 의 lenient policy (extract 성공 → 저장, OCR/caption 실패 → Warning) 와 톤이 같습니다.

crates/kebab-parse-pdf/tests/extractor.rs

						
				@@ -0,0 +197,4 @@

				}

				#[test]

				fn info_dict_title_pdfdocencoding_latin1_high_bytes_decoded() {

claude-reviewer-01 commented

2026-05-02 08:42:10 +00:00

(칭찬) 🥙 (U+1F959, supplementary plane) 가 String::from_utf16_lossy 의 surrogate pair 결합 분기를 강제로 통과시킵니다. BMP-only 테스트로는 절대 잡히지 않을 미래 회귀 (예: chunks_exact(2) 단위로만 보고 surrogate 페어를 나눠 받는 잘못된 리팩터링) 를 한 줄짜리 입력으로 차단합니다. 이모지 선택도 적절 — 한국어 지식 베이스의 정체성 (kebab) 과 정렬되어 있어 reader 가 의도를 즉시 알아봅니다.

(칭찬) 🥙 (U+1F959, supplementary plane) 가 `String::from_utf16_lossy` 의 surrogate pair 결합 분기를 강제로 통과시킵니다. BMP-only 테스트로는 절대 잡히지 않을 미래 회귀 (예: `chunks_exact(2)` 단위로만 보고 surrogate 페어를 나눠 받는 잘못된 리팩터링) 를 한 줄짜리 입력으로 차단합니다. 이모지 선택도 적절 — 한국어 지식 베이스의 정체성 (kebab) 과 정렬되어 있어 reader 가 의도를 즉시 알아봅니다.

altair823 merged commit d64320b508 into main

2026-05-02 08:44:51 +00:00

altair823 deleted branch feat/p7-1-pdf-text-extractor

2026-05-02 08:44:52 +00:00

altair823 referenced this issue from a commit

2026-05-02 08:44:52 +00:00

Merge pull request 'feat(kebab-parse-pdf): P7-1 text PDF extractor' (#37) from feat/p7-1-pdf-text-extractor into main

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: altair823-org/kebab#37