kebab

Author	SHA1	Message	Date
th-kim0823	bf4ebf8d2a	feat(p10-1a-1): add Metadata.repo / git_branch / git_commit / code_lang Four optional, serde-skipped-when-None fields added to `Metadata` for code ingest context. All 11 downstream construction sites patched with `repo: None, git_branch: None, git_commit: None, code_lang: None`. Full workspace check (`--tests`) and per-crate test suite pass clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 15:44:18 +09:00
altair823	f867b36afb	feat(kebab-core): p9-fb-23 task 2 — CanonicalDocument gains last_chunker_version + last_embedding_version Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:50:25 +00:00
altair823	3a57cab1eb	fix(kebab-store-sqlite): purge stale assets row on workspace_path orphan + smoke P7-3 통합 테스트가 노출한 storage 레이어 버그 fix. `assets.workspace_path` 의 UNIQUE 제약과 `upsert_asset_row` 의 `ON CONFLICT(asset_id)` 만 처리하던 gap 사이 — byte 가 변경된 자산 re-ingest 시 새 asset_id 가 같은 workspace_path 에서 secondary UNIQUE 충돌. md / image / pdf 모두 영향. Fix: - 새 helper `purge_orphan_at_workspace_path` 가 같은 `workspace_path` 의 다른 `asset_id` 를 발견하면 documents → assets 순서로 sweep. documents 의 ON DELETE RESTRICT 회피 + CASCADE 로 blocks / chunks / embedding_records 정리. copied 모드면 storage_path 의 byte 파일도 best-effort 삭제. - `put_asset_with_bytes` 의 두 분기 (copy / reference) + `DocumentStore ::put_asset` 모두 호출. - 회귀 테스트 `put_asset_with_bytes_sweeps_workspace_path_orphan` (이전 의 "UPSERT 실패시 orphan 청소" 테스트가 더 이상 doable 하지 않으므로 대체). - `re_ingest_edited_pdf_produces_new_doc_id` integration `#[ignore]` 해제 → 9 통합 테스트 모두 default 로 통과. Vector store orphan 은 별도 P+ task — LanceDB 가 SQLite cascade 와 무관하게 운영되므로 stale chunk_id vector 가 디스크에 남음. 검색에는 영향 없음 (search 가 SQLite join 통해 surface). Smoke 검증 (release binary, markdown 2 + image 1 + PDF 2): - doctor pass - 첫 ingest: 5 new - list docs: 5 docs all media types - search lexical "pdf-page-v1 chunker" → whitepaper.pdf hit - search hybrid → cross-media 결과 - inspect doc PDF: parser_version=pdf-text-v1, blocks 가 SourceSpan::Page - 동일 byte re-ingest: 5 updated, 0 errors (P1 idempotency) - byte 수정 후 re-ingest: 1 new (해당 PDF) + 4 updated, 0 errors (storage fix) - corrupt PDF 추가: errors+=1 + IngestItem.error 메시지 정확, 다른 자산 영향 0 - 정리 후 다시 ingest: errors=0 - RAG ask: PDF 인용 + `citations[].citation` 에 `kind: "page"` + `page: <N>` + `path: <pdf_path>` 정확히 노출 운영 fixture 보조: - `crates/kebab-parse-pdf/examples/gen_smoke_pdf.rs` — `cargo run --release --example gen_smoke_pdf -p kebab-parse-pdf -- <out.pdf> <text-pages>` 로 reportlab/qpdf 없이 in-tree PDF 생성. - `crates/kebab-parse-image/examples/gen_smoke_png.rs` — 동일 방식의 PNG fixture 생성. - SMOKE.md 가 두 example 사용법 + 갱신된 HOTFIXES 동작 (byte 수정 시 errors+=1 → new+=1) 반영. HOTFIXES `2026-05-02 P7-3` entry 가 \"deferred\" → \"fixed in same PR\" 로 업데이트, vector store orphan caveat 만 남음. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:41:23 +00:00
altair823	8de08cf38c	review(p7-1): 회차 1 지적 반영 - Cargo.toml: 사용하지 않는 deps 제거 (`kebab-config`, `thiserror`, `pdf-extract`, dev `tempfile` / `serde_json` / `serde`). 특히 `pdf-extract` 가 끌어오던 transitive ~150 crate (pom, postscript, type1-encoding-parser, adobe-cmap-parser, euclid, chrono, md5, linked-hash-map …) 가 모두 사라짐. lopdf 만 남음. - info.rs: BOM 없는 PDFDocEncoded Title 디코드 버그 수정. `from_utf8_lossy` 는 0x80–0xFF 를 U+FFFD 로 치환해 "Café" 같은 레거시 타이틀을 망가뜨림. byte → `char` 직접 캐스팅 (Latin-1 디코더) 로 교체. 회귀 테스트 `info_dict_title_pdfdocencoding_latin1_high_bytes_decoded` 추가. - info.rs: 모듈 doc 의 "Latin-1 superset" 부정확 표현 정정 — PDFDocEncoding 은 0x18–0x1F / 0x80–0x9F 영역에서 Latin-1 과 다름. - lib.rs: `saturating_sub(1)` 가 page=0 케이스를 silent 흡수하던 부분에 `debug_assert!` 추가. release 는 saturating fallback 유지 (panic 보다 garbled order 가 운영에 유리). - tests: UTF-16 surrogate pair 커버리지 갭 보완 — 🥙 (U+1F959) 가 포함된 타이틀로 `String::from_utf16_lossy` 의 페어-결합 경로 검증. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:40:40 +00:00
altair823	5a158d7343	feat(kebab-parse-pdf): P7-1 text PDF extractor — per-page CanonicalDocument `PdfTextExtractor`(MediaType::Pdf) lopdf 기반 per-page 텍스트 추출. 페이지마다 `Block::Paragraph` + `SourceSpan::Page { page, char_start, char_end }` emit. 본문이 비거나 추출 panic 인 페이지는 빈 paragraph + `Provenance::Warning` ("scanned candidate") 로 표시 — 이후 OCR fallback (별도 task) 의 입력. 핵심 동작: - `lopdf::Document::load_mem` + `is_encrypted()` → 암호화 PDF 는 명시 에러 (`qpdf --decrypt` 안내). - 페이지 단위 `extract_text(&[page])` 를 `catch_unwind` 로 감싸 malformed page panic 을 recoverable warning 으로 변환. - `/Info` dict 에서 Title/Producer/Creator best-effort 추출. UTF-16BE BOM prefixed 문자열도 디코드 (한국어 등 non-ASCII Title 정상 처리). - 9개 통합 테스트: 3-page emit, scanned-mixed warning, encrypted refuse, corrupt header error, page_count 메타, UTF-16BE Title, filename fallback, determinism, snapshot. `parser_version = "pdf-text-v1"`. Allowed deps: `lopdf 0.32` + `pdf-extract 0.7` (원본 spec 그대로). 본문 다국어 OCR fallback 은 §9.2 후속 task (out of scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:34:55 +00:00

5 Commits