3-round dogfood-driven fix cycle 의 산출물: - bugfix1 (Bug #2/#3/#4): spec 964 line + plan 848 line - bugfix2 (Bug #6/#7, #8 falsified): spec 308 line + plan 388 line - bugfix3 (Bug #9/#10/#11/#13/#14, #12 falsified): spec 410 line + plan 1043 line - docs/DOGFOOD.md: 전방위 dogfood checklist 의 전체 (§0 environment ~ §13 reference corpus) 각 round 의 spec/plan 가 critic + verifier round 2 closure ACCEPT 후 frozen. dogfood-driven evidence 기반. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
24 KiB
DOGFOOD — kebab 도그푸딩 시나리오
본 file 은 kebab 의 모든 기능에 대한 dogfood 시나리오 의 single reference. 새 dogfood request 시 본 file 의 시나리오 list 를 참조. 새 sub-item 또는 새 release 마다 §관련 section 의 시나리오 갱신.
사용법: 도그푸딩 요청 시 본 doc 의 시나리오 카탈로그 → 적절한 시나리오 list 선정 → 환경 setup (§0) → 시나리오 실행 → bug log → 발견된 bug 즉시 fix cycle.
목차
- §0 환경 + 사전 준비
- §1 Ingest
- §2 Search
- §3 Ask (RAG)
- §4 Inspect / Fetch / List
- §5 Version cascade
- §6 Wire schema (
--json) - §7 TUI (P9)
- §8 MCP (P9-fb-30)
- §9 Doctor / Schema / Reset
- §10 Eval (P5)
- §11 Edge cases
- §12 Bug discovery checklist
- §13 Reference dogfood corpus
§0 환경 + 사전 준비
0.1 release binary
export CARGO_TARGET_DIR=/build/out/cargo-target/target
export RELEASE_BIN="${CARGO_TARGET_DIR:-target}/release/kebab"
cargo build --release -p kebab-cli -j 4 2>&1 | tail -5
"$RELEASE_BIN" --version # 기대: kebab X.Y.Z (current workspace version)
0.2 Ollama endpoint check
# Remote (사용자 dogfood host)
curl -s --connect-timeout 3 http://192.168.0.47:11434/api/tags | jq -r '.models[]?.name'
# 기대: qwen2.5vl:3b / gemma4:e4b / gemma4:26b / bge-m3:latest / nomic-embed-text:latest
# Local fallback
curl -s --connect-timeout 3 http://localhost:11434/api/tags 2>&1 | head -3
0.3 isolated KB workspace
DOGFOOD=/build/cache/tmp/<sub-item>-dogfood
mkdir -p "$DOGFOOD/kb" "$DOGFOOD/data"
# Default config from `kebab init` then customize
HOME=/tmp/dogfood-home XDG_CONFIG_HOME="$DOGFOOD/xdg" "$RELEASE_BIN" init
cp "$DOGFOOD/xdg/kebab/config.toml" "$DOGFOOD/config.toml"
# 사용자 customize: workspace.root, storage.data_dir, models.{embedding,llm}.endpoint, 기능 enable
0.4 disk + cargo clean policy
memory feedback-cargo-clean-policy:
/buildavail < 500GB OR target > 200GB 시만 cargo clean.- 평소 incremental build.
0.5 dogfood corpus location
본 doc §13 reference. 표준 corpus = PoC 9 PDF + markdown notes + code samples + image set.
§1 Ingest
§1.1 Markdown ingest (P1)
기본:
# Test markdown KB
mkdir -p "$DOGFOOD/kb/notes"
cat > "$DOGFOOD/kb/notes/sample.md" <<'EOF'
# Title
본문 paragraph. **bold** + *italic*.
## Heading 2
- list item 1
- list item 2
```rust
fn main() { println!("code block"); }
EOF
"$RELEASE_BIN" ingest --config "$DOGFOOD/config.toml" --json | tail -1 | jq '.items[] | {doc_path, kind, block_count, chunk_count}'
**verify**:
- `kind` = `"new"` (first ingest) / `"unchanged"` (no change) / `"updated"` (modified).
- `block_count` ≥ heading/paragraph count.
- `chunk_count` ≥ 1.
**scenarios**:
- 1.1.a empty markdown (0 byte) → expected warning + 0 chunks.
- 1.1.b deeply nested heading (h1-h6) → block 모두 capture.
- 1.1.c frontmatter (YAML/TOML) → metadata preserve.
- 1.1.d code fence (Rust/Python/etc.) → code block 으로 chunk.
- 1.1.e mixed inline (link/image/inline-code/strong/em) → inline preserve.
- 1.1.f huge markdown (1 MB+) → walker pass + chunk count delta.
### §1.2 Image ingest (P6)
**Config**:
```toml
[image.ocr]
enabled = true
engine = "ollama-vision"
model = "gemma4:e4b"
endpoint = "http://192.168.0.47:11434"
[image.caption]
enabled = false # opt-in
verify:
*.png/*.jpg/*.jpeg만 ingest target.- OCR text 가
Block::ImageRef.ocr.joined안. [image.caption].enabled=true시 caption 도.
scenarios:
- 1.2.a Korean OCR (한국어 scan PNG) → OCR text + search hit.
- 1.2.b English OCR (typed text screenshot) → alnum > 90%.
- 1.2.c photo (자연 사진, OCR 없음) → empty OCR or warning.
- 1.2.d corrupt image → graceful error.
- 1.2.e oversized image (> max_pixels) → downscale.
§1.3 PDF text ingest (P7-1)
기본:
- vector PDF (embedded text) —
PdfTextExtractor가 page 별 text 추출. - 1
Block::Paragraphper page (P7-1 invariant).
verify:
parser_version = "pdf-text-v1".chunker_version = "pdf-page-v1"(또는"pdf-page-v1.1"from v0.20.1).block_count≥ page count.
scenarios:
- 1.3.a single page vector PDF → 1 block.
- 1.3.b multi-page (10+ pages) → block per page.
- 1.3.c large PDF (50+ pages, 50 MB+) → ingest 시간 + memory monitor.
- 1.3.d encrypted PDF → friendly error wording (
qpdf --decrypthint). - 1.3.e corrupt PDF (truncated bytes) → graceful error.
- 1.3.f PDF with annotations / forms → extract 동작.
§1.4 PDF scanned OCR ingest (v0.20.0 sub-item 1)
Config:
[pdf.ocr]
enabled = true
always_on = false
model = "qwen2.5vl:3b"
endpoint = "http://192.168.0.47:11434"
valid_ratio_threshold = 0.5
min_char_count = 20
verify:
- IngestEvent::PdfOcrStarted / PdfOcrFinished emit.
IngestItem.pdf_ocr_pages > 0for scanned PDF.IngestItem.pdf_ocr_ms_total > 0.- CLI printer:
📷 OCR page N.../✓ OCR page N (chars chars, msms via ollama-vision).
scenarios (이전 dogfood report 의 9 시나리오):
- 1.4.a scanned 한국어 (F1 / F2) → OCR text indexed + search hit.
- 1.4.b multi-scanned PDF (5+ files) → chunk_id collision 0 (Bug #3 fix verify).
- 1.4.c always_on=true → vector PDF page 도 dual-block OCR.
- 1.4.d valid_ratio_threshold variation (0.3 / 0.5 / 0.8) → mojibake / 정상 page 분류.
- 1.4.e min_char_count variation (5 / 20 / 100) → 짧은 page OCR 호출.
- 1.4.f DCTDecode-only skip (F6 FlateDecode / F7 CCITTFax) → warning + skip.
- 1.4.g force-reingest (
--force-reingest) → OCR 재실행. - 1.4.h cancel handle (Ctrl+C 또는 SIGINT) → graceful abort.
- 1.4.i v0.19 → v0.20 upgrade UX (parser_version 보존 + manual force-reingest 필요).
§1.5 Code ingest (P10)
Tier 1 (AST chunkers) — Rust / Python / TS / JS / Go / Java / Kotlin / C / C++
Config:
[ingest.code]
max_file_bytes = 262144
max_file_lines = 5000
ast_chunk_max_lines = 200
verify per lang:
chunker_version=code-{lang}-ast-v1.- symbol path correct (file-scope nesting / module-path / class-method nesting).
extra_skip_globs동작.
scenarios:
- 1.5.a Rust crate (workspace + multiple modules) → mod / fn / impl chunks.
- 1.5.b Python package (src/ layout, dotted module path).
- 1.5.c TypeScript/JavaScript (decorators, generators, classes).
- 1.5.d Go (package + struct methods).
- 1.5.e Java/Kotlin (class + inner class + method).
- 1.5.f C (typedef alias unit, header).
- 1.5.g C++ (namespace::Class::method recursive).
- 1.5.h
.h파일 (C vs C++ syntax) → tree-sitter-c parse 실패 시 Tier 3 fallback.
Tier 2 (resource-aware) — yaml/k8s, dockerfile, manifest (toml/json/xml/groovy)
verify:
- k8s YAML:
apiVersion+kindper resource. - Dockerfile: whole-file
dockerfile-file-v1. - Cargo.toml / package.json / pom.xml: whole-file
manifest-file-v1.
scenarios:
- 1.5.i k8s manifest (multi-resource via
---). - 1.5.j Dockerfile (multi-stage build).
- 1.5.k Cargo.toml workspace (members + dependencies).
- 1.5.l invalid YAML → Tier 3 fallback.
Tier 3 (paragraph fallback) — shell, non-k8s YAML, AST failure
verify:
chunker_version = "code-text-paragraph-v1".- Blank-line paragraph segmentation + 80-line / 20-overlap window for oversize.
scenarios:
- 1.5.m
.sh/.bash/.zsh→ paragraph chunks. - 1.5.n empty file → 0 chunks + warning.
- 1.5.o very long shell script (1000+ lines) → line-window split.
§1.6 Single-file / stdin ingest (p9-fb-31)
# Workspace 외부 file
"$RELEASE_BIN" ingest-file ~/Documents/external.md --config "$DOGFOOD/config.toml"
# 기대: _external/<hash>.md 에 copy + ingest.
# stdin
echo "# stdin content" | "$RELEASE_BIN" ingest-stdin --title "from stdin" --config "$DOGFOOD/config.toml"
scenarios:
- 1.6.a external markdown ingest.
- 1.6.b stdin with
--source-uriflag. - 1.6.c .kebabignore matched file → warn + 진행.
- 1.6.d binary file → reject (markdown only).
§1.7 Incremental ingest
verify:
- 첫 ingest 후 다시 ingest → 모두
unchanged. - 일부 file 수정 후 → 해당 file 만
updated. - 새 file 추가 →
new. - 삭제된 file →
purged_deleted_filescount.
scenarios:
- 1.7.a unchanged path (parser_version + chunker_version + embedding_version match).
- 1.7.b stale file purge.
- 1.7.c
--force-reingest(force-update path).
§1.8 Ingest progress (wire ingest_progress.v1)
--json mode 의 ndjson stream:
scan_started→scan_completed→(asset_started → [pdf_ocr_*]* → asset_finished)+→completed|aborted.
verify:
- ordering invariant (design §2.4a).
- per-asset
idx/total/path/media/result/chunks. - aggregate
countsoncompleted/aborted.
§2 Search
§2.1 Lexical search (P2)
"$RELEASE_BIN" search --config "$DOGFOOD/config.toml" "한국어" --mode lexical --k 10
verify:
- FTS5 trigram (v0.17.0) — 한국어 2자 이상 query hit.
chunks_ftsschema (text,heading_path별 column).
scenarios:
- 2.1.a Korean trigram query (
해시 충돌). - 2.1.b English/Korean mixed (
Rust 충돌). - 2.1.c 1-char query → 0 hit + hint.
- 2.1.d raw mode escape (
heading_path : <token>). - 2.1.e FTS5 phrase query (
"specific phrase"). - 2.1.f exclusion (
-token).
§2.2 Vector search (P3)
"$RELEASE_BIN" search --config "$DOGFOOD/config.toml" "어떤 문장의 의미적 검색" --mode vector --k 10
verify:
multilingual-e5-small(384d) orbge-m3embedding.- LanceDB 의 model 별 separate table.
- similarity score normalized.
scenarios:
- 2.2.a Korean semantic query.
- 2.2.b English semantic query.
- 2.2.c domain-specific (code semantic).
- 2.2.d cross-lingual (한영 mixed query).
§2.3 Hybrid search (RRF, P3)
"$RELEASE_BIN" search --config "$DOGFOOD/config.toml" "query" --mode hybrid --k 10
verify:
fusion_score = [0, 1](normalize).- lexical + vector → RRF fusion.
scenarios:
- 2.3.a hybrid 의 lexical-only / vector-only 가 못 잡는 case (RRF win).
- 2.3.b fusion_score ordering.
§2.4 Search filters (p9-fb-36)
"$RELEASE_BIN" search --config "$DOGFOOD/config.toml" "query" --tag rust --tag api --lang en --path-glob 'src/**'
verify:
--tagrepeatable (OR within).--langISO code.--path-globworkspace_path glob.
§2.5 Search pagination (p9-fb-34)
"$RELEASE_BIN" search "query" --max-tokens 1000 --snippet-chars 200 --json | jq '.next_cursor'
"$RELEASE_BIN" search "query" --cursor "$(...)" --json
verify:
next_cursoropaque token.corpus_revisionmismatch →stale_cursorerror.
§2.6 Search cache (p9-fb-19)
"$RELEASE_BIN" search "query" --json # first call
"$RELEASE_BIN" search "query" --json # cached (in-process LRU, no-op in CLI)
"$RELEASE_BIN" search "query" --no-cache --json # force fresh
§2.7 Bulk search
stdin queries 의 batch:
echo -e "query 1\nquery 2\nquery 3" | "$RELEASE_BIN" search --bulk --json
§3 Ask (RAG)
§3.1 Basic ask (P4)
"$RELEASE_BIN" ask --config "$DOGFOOD/config.toml" "어떤 질문" --json
verify:
groundedfield (boolean).citationslist (chunk references).answertext.
scenarios:
- 3.1.a in-corpus question (grounded=true).
- 3.1.b out-of-corpus question (grounded=false + refusal).
- 3.1.c hallucination check (paraphrase test, fb-41).
§3.2 Streaming ask (v0.17.1)
"$RELEASE_BIN" ask "query" --stream
verify:
- per-event
answer.v1(delta tokens). - final event with
verification+citations.
§3.3 Multi-hop RAG (fb-41 / v0.18.0)
"$RELEASE_BIN" ask "complex question" --multi-hop --json
verify:
- decompose → decide → synthesize loop.
multi_hop_max_depth/multi_hop_max_sub_queries_per_iter따름.
§3.4 NLI verification (fb-41 / v0.18.0)
[models.nli]
model = "Xenova/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
[rag]
nli_threshold = 0.5
verify:
verification.entailment_score.refusal_reason = "nli_verification_failed"(low score).
scenarios:
- 3.4.a known hallucination case (S7 caffeine) → reject with nli_score < threshold.
- 3.4.b legitimate grounded answer → entailment > 0.5.
- 3.4.c NLI model unavailable →
refusal_reason = "nli_model_unavailable".
§3.5 Ask filters
"$RELEASE_BIN" ask "query" --tag rust --path-glob 'src/**' --lang en
§4 Inspect / Fetch / List
§4.1 Inspect
"$RELEASE_BIN" inspect document <doc_id> --json
"$RELEASE_BIN" inspect chunk <chunk_id> --json
verify:
chunk_inspection.v1schema.canonical_document.parser_version/chunker_version.
§4.2 Fetch (p9-fb-35)
verbatim chunk / doc / span fetch:
"$RELEASE_BIN" fetch <chunk_id> --json
"$RELEASE_BIN" fetch <doc_id> --span 12-34 --json
verify:
fetch_result.v1schema.- verbatim text (no chunk wrapping).
§4.3 List
"$RELEASE_BIN" list documents --json
"$RELEASE_BIN" list chunks --json
§5 Version cascade
design §9 cascade rule:
parser_version변경 → 해당 parser 의 모든 chunk 무효.chunker_version변경 → 해당 chunker 의 모든 chunk 무효.embedding_version변경 → 모든 embedding 무효.
verify scenarios:
- 5.a parser_version bump (예:
pdf-text-v1→pdf-text-v2) → 자동 invalidation + 다음 ingest 가 재처리. - 5.b chunker_version bump (예: v0.20.1 의
pdf-page-v1→pdf-page-v1.1) → chunk_id 재계산. - 5.c embedding_version bump (예:
multilingual-e5-small/v1→/v2) → LanceDB 의 별 table. - 5.d 동일 asset 의 doc_id 다른 case →
purge_workspace_path_for_parser_bumpcascade. - 5.e force-reingest 의 user-facing UX.
§6 Wire schema (--json)
§6.1 schemas list
docs/wire-schema/v1/:
ingest_progress.v1,ingest_report.v1search_hit.v1,search_response.v1,bulk_search_item.v1,bulk_search_response.v1answer.v1,answer_event.v1chunk_inspection.v1,citation.v1,doc_summary.v1doctor.v1,schema.v1,fetch_result.v1,reset_report.v1error.v1
§6.2 verify per schema
각 schema:
$RELEASE_BIN <subcommand> --json | jq '.schema_version'
# 기대: "<schema>.v1"
JSON schema validity:
jq -e 'has("schema_version")' <output>
ajv-cli validate -s docs/wire-schema/v1/<schema>.schema.json -d <output>
§6.3 wire backward-compat
verify:
- additive minor (new field / new enum value) → older consumer 가 graceful.
- breaking change (field removal / type change) →
v2major bump.
§6.4 schema_version cascade
schema.v1 (kebab schema --json) output 의 wire_schemas field:
- 16+ entry 의
{name, version, capabilities}.
§7 TUI (P9)
§7.1 Launch TUI
"$RELEASE_BIN" tui --config "$DOGFOOD/config.toml"
§7.2 4 panel verify
- P9-1 Library: workspace document tree + recent assets.
- P9-2 Search: lexical / vector / hybrid search panel.
- P9-3 Ask: question + answer pane + citations.
- P9-4 Inspect: chunk / document detail.
§7.3 keyboard shortcuts
- Tab / Shift+Tab — switch panel.
- Esc — cancel ongoing op.
- q — quit.
- 기타 panel-specific.
§7.4 scenarios
- 7.4.a library tree navigation.
- 7.4.b search query + result selection → fetch.
- 7.4.c ask question + answer + citation click → inspect.
- 7.4.d cancel mid-ingest (Esc).
- 7.4.e quit + restart → state preserve.
§8 MCP (P9-fb-30)
§8.1 Launch MCP stdio server
"$RELEASE_BIN" mcp
# (agent host 가 stdio 로 호출)
§8.2 6 MCP tools
search,ask,schema,doctor,ingest_file,ingest_stdin.
§8.3 verify per tool
- input schema (JSON Schema).
- output schema (wire
*.v1). - error path (graceful, exit code).
§9 Doctor / Schema / Reset
§9.1 Doctor (health check)
"$RELEASE_BIN" doctor --json
verify:
doctor.v1schema.- workspace / storage / models / ollama_reachability.
§9.2 Schema (introspection report)
"$RELEASE_BIN" schema --json
verify:
schema.v1schema.- wire_schemas / capabilities / model_versions / stats.
§9.3 Reset
"$RELEASE_BIN" reset --yes
"$RELEASE_BIN" reset --data-only --yes
"$RELEASE_BIN" reset --vector-only --yes
verify:
- XDG data dirs wipe (irreversible — TTY confirm 또는
--yes). - macOS path collision 회피 (HOTFIXES 2026-05-07).
--data-only가 config 보호.
§10 Eval (P5)
"$RELEASE_BIN" eval --config "$DOGFOOD/config.toml"
verify:
- golden query suite 의 metrics (MRR / Recall / NDCG).
- regression detection (snapshot 비교).
§11 Edge cases
§11.1 Encrypted PDF
- input: thermal-pos-printer.pdf / thermal-label.pdf (사용자 dogfood corpus).
- expected:
kind: "Error"+ friendly wording (qpdf --decrypthint).
§11.2 Corrupt file
- input: truncated bytes / invalid magic.
- expected:
kind: "Error"+ graceful error.
§11.3 Empty file
- input: 0-byte file.
- expected: 0 chunks + warning.
§11.4 Very large file
- markdown 100 MB+, PDF 100 MB+.
- expected: walker pass (per file type limit), parser graceful.
§11.5 env variable overrides
KEBAB_PDF_OCR_ENABLED=true \
KEBAB_PDF_OCR_MODEL=qwen2.5vl:7b \
"$RELEASE_BIN" ingest --config "$DOGFOOD/config.toml"
verify per env:
KEBAB_PDF_OCR_*(11 env, v0.20.0).KEBAB_IMAGE_OCR_*(P6).KEBAB_MODELS_LLM_*,KEBAB_MODELS_EMBEDDING_*.KEBAB_READONLY(write-path subcommand 차단).
§11.6 Cancel handle
- 대형 PDF (metro-korea 58MB) ingest 도중 SIGINT (Ctrl+C).
- expected: graceful abort +
IngestEvent::Aborted+ partial counts.
§11.7 --readonly mode
KEBAB_READONLY=1 "$RELEASE_BIN" ingest # 기대: refuse
KEBAB_READONLY=1 "$RELEASE_BIN" search "query" # 기대: OK
§11.8 --quiet mode
"$RELEASE_BIN" ingest --quiet # stderr 0
"$RELEASE_BIN" ingest --json # implies quiet
§11.9 .kebabignore
# In workspace root
node_modules/
*.tmp
draft/**
verify:
- ignore patterns 정확히 적용.
- per-directory
.kebabignorecascading.
§11.10 Config edge cases
- missing config → fall back to XDG default.
- malformed TOML →
error.v1withconfig_invalid. - unknown field → tolerant (forward-compat) 또는 strict (TBD).
- workspace.root path 변수 (
~,${XDG_DATA_HOME}, relative path) 모두 동작.
§11.11 Concurrent access
- 동일 KB 의 multiple ingest 동시 실행 → SQLite lock 또는 graceful queue.
- ingest 중 search → consistent snapshot.
§11.12 macOS path collision (HOTFIXES 2026-05-07)
config_dir()≠data_dir()보장.- legacy
~/Library/...path 의 자동 migration.
§12 Bug discovery checklist
새 feature 또는 새 release 마다 본 checklist 실행:
§12.1 Pre-flight
- release binary build clean.
- Ollama endpoint reachable + 사용 model pull.
- isolated KB workspace 분리.
- config.toml = default + minimal customize.
§12.2 Ingest path
- 각 media type 의 baseline (markdown / image / pdf / code / Tier2/3).
- empty / corrupt / encrypted file.
- very large file (size limit verify).
- incremental + force-reingest cycle.
- env var override.
- cancel handle mid-ingest.
§12.3 Search path
- lexical (한국어 + 영어 + mixed).
- vector (semantic).
- hybrid (RRF).
- filter (tag/lang/path-glob).
- pagination cursor.
- bulk search.
§12.4 Ask path
- in-corpus question (grounded=true).
- out-of-corpus question (grounded=false).
- streaming (
--stream). - multi-hop (
--multi-hop). - NLI verification (known hallucination case).
§12.5 Surface verify
- CLI flags (각 subcommand 의 --help + actual behavior).
- TUI panel + keyboard shortcuts.
- MCP stdio (each tool).
- Wire schema (
--jsonmode 의 schema_version + jq validity).
§12.6 Version cascade
- parser_version bump → 자동 invalidation.
- chunker_version bump → chunk_id 재계산.
- embedding_version bump → LanceDB 별 table.
§12.7 Edge cases
- env override (각 KEBAB_* env).
- readonly mode.
- quiet mode.
- .kebabignore.
- config edge cases.
- concurrent access.
§12.8 Doc + Wire schema 정합
- README + HANDOFF + ARCHITECTURE 의 사용자 visible surface 일치.
- wire schema 의 actual emit field 와 schema.json 일치.
- error.v1 의
code값이 실제 surface 와 일치.
§12.9 Bug discovery → immediate fix cycle
사용자 명시: "아무리 작은거여도 발견되면 바로 이어서 그걸 수정하는 작업들을 하도록 해".
bug 발견 시:
- immediate stop dogfood (현 scenario 중단).
- bug log —
.omc/reviews/<date>-<sub-item>-dogfood-report.md에 다음 fields:- file:line of root cause.
- reproduction command.
- expected vs actual.
- severity (Critical / Important / Minor).
- spec/plan/executor cycle (size-adapted):
- small bug (1-line fix / wording typo / config field rename): simplified spec (1 page) + plan (1-2 step) + executor.
- large bug (cross-crate / wire schema / new invariant): full spec/plan/executor cycle (이전 v0.20-sub1-bugfix scale).
- fix verify — workspace test + clippy + dogfood scenario 재확인.
- dogfood resume from where it left off.
§13 Reference dogfood corpus
§13.1 PDF (9 file, PoC + sub-item 1)
| # | File | Size | Source | Use case |
|---|---|---|---|---|
| 1 | scanned_page1.pdf (F1) | 466 KB | crates/kebab-parse-pdf/tests/fixtures/ |
scanned 한국어 일반 (OCR alnum > 85%) |
| 2 | scanned_page2.pdf (F2) | 773 KB | crates/kebab-parse-pdf/tests/fixtures/ |
scanned 받침 intensive (OCR alnum > 70%) |
| 3 | metro-korea.pdf | 58 MB | /build/cache/pdf-ocr-poc/fixtures/ |
real-world 신문 multi-page vector PDF |
| 4 | mojibake.pdf (F4) | 23 KB | crates/kebab-parse-pdf/tests/fixtures/ |
vector PDF (Latin + no ToUnicode CMap) |
| 5 | flate_raw.pdf (F6) | 872 B | crates/kebab-parse-pdf/tests/fixtures/ |
FlateDecode skip path |
| 6 | ccitt.pdf (F7) | 2 KB | crates/kebab-parse-pdf/tests/fixtures/ |
CCITTFax skip path |
| 7 | thermal-pos-printer.pdf | 1.1 MB | ~/paperboy/ |
ENG manual PDF (encrypted) |
| 8 | thermal-label.pdf | 2.7 MB | ~/paperboy/ |
ENG manual PDF (encrypted) |
| 9 | internals-presentation.pdf | 820 KB | ~/namu-crawler/docs/ |
slide deck PDF (vector text) |
§13.2 PoC ground-truth (alnum e2e)
/build/cache/pdf-ocr-poc/ground-truth/:
page1.txt(1489 byte) — F1 ground-truth.page2-batchim.txt(3282 byte) — F2 ground-truth (받침 intensive).
§13.3 Markdown / code corpus
(향후 sub-item 별 추가 — ~/Documents/notes/ 또는 별 prepared fixture).
§13.4 Image corpus
(P6 dogfood — 향후 추가).
부록 A — 본 doc 의 갱신 정책
- 새 release / sub-item 머지 시 §1 (Ingest) 또는 §2-§9 (관련 section) 갱신.
- 새 dogfood report (
.omc/reviews/<date>-<sub-item>-dogfood-report.md) 의 scenarios 가 본 doc 의 표준 시나리오 와 align. - 본 doc 의 시나리오 자체에 새 scenarios 추가 시 PR 안 §12 checklist update.
- HOTFIXES.md 의 deviation 발견 시 본 doc 의 관련 section 의 verify step 강화.
부록 B — Reviewer 의 dogfood verify
PR review 시 reviewer 가 본 doc 의 §12 checklist 참조:
- 새 feature 의 변경 surface (CLI flag / wire schema / config field) 가 §1-§9 의 시나리오 cover 하는지.
- new bug discovery 시 §12.9 의 immediate fix cycle 적용.