마지막 commit. 모든 .md 안의 `kb` 단어 일괄 갱신. - 19 개 crate 이름 (`kb-core`, `kb-app`, …) → `kebab-*` (Rust 모듈 path 표기 `kb_*` → `kebab_*` 포함). - 미래 component (`kb-tui`, `kb-desktop`, `kb-asr-whisper`, `kb-ocr`, `kb-mcp`, `kb-vlm`, `kb-rerank`, `kb-vision-ocr`, `kb-index`, `kb-smoke`, `kb-architecture`) → `kebab-*` (P6+ 가 시작될 때 같은 prefix 사용). - CLI 명령 예제: `kb ingest` / `kb search` / `kb ask` / `kb init` / `kb doctor` / `kb inspect` / `kb list` / `kb eval` → `kebab <verb>`. fenced code block + 인라인 backtick 모두. - XDG paths + env vars + binary 경로 (`target/release/kb` → `target/release/kebab`) 동기화. - design doc / 최초 보고서 / SMOKE / HOTFIXES / phase epic / task spec 모든 reference 통일. - task-decomposition.md 의 `git -c user.name=kb` 는 과거 git history 기록용 author 정보라 그대로 유지 (실제 git history 의 author 는 변경 불가). - `tasks/phase-5-evaluation.md` 의 `status: planned` → `completed` 도 같이 (P5-1 + P5-2 PR 머지 후 미반영분). ## 검증 - `grep -rEn "\bkb-[a-z]|\bkb_[a-z]|\.config/kb\b|kb\.sqlite|\bKB_[A-Z]" --include="*.md"` 0 hits (task-decomposition.md 의 git author 제외). - 모든 file path reference 살아있음 (renamed file 들 모두 새 path 로 update). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1356 lines
59 KiB
Markdown
1356 lines
59 KiB
Markdown
# KB Task Decomposition Implementation Plan
|
||
|
||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||
|
||
**Goal:** Decompose the KB project into ~30 component-level task spec files (one self-contained PR/agent unit each) so that AI-driven implementation runs with stable contracts and minimal cross-task spec drift.
|
||
|
||
**Architecture:**
|
||
- Phase A — Author the canonical task spec template (`tasks/_template.md`).
|
||
- Phase B — Decompose P1 (Markdown ingestion) into 6 component task specs to validate the template.
|
||
- Phase C — After Phase B passes review, decompose P0 + P2..P9 into the remaining ~24 task specs in one pass.
|
||
- Each task spec cites only the frozen design doc (`docs/superpowers/specs/2026-04-27-kebab-final-form-design.md`) for types, traits, schema, layout. No new domain types or traits are introduced inside task specs.
|
||
|
||
**Tech Stack:** Plain Markdown documents under `tasks/`. No code changes in this plan — produces specs that AI sub-agents will later implement against.
|
||
|
||
**Frozen contract source:** [docs/superpowers/specs/2026-04-27-kebab-final-form-design.md](../specs/2026-04-27-kebab-final-form-design.md). All task specs reference this. Modifications to the contract require updating that file first, then re-checking dependent task specs.
|
||
|
||
**Phase task index (target file layout):**
|
||
|
||
```
|
||
tasks/
|
||
├── INDEX.md # already exists — to be updated to link component tasks
|
||
├── _template.md # Phase A
|
||
├── p0/
|
||
│ └── p0-1-skeleton.md
|
||
├── p1/
|
||
│ ├── p1-1-source-fs.md
|
||
│ ├── p1-2-parse-md-frontmatter.md
|
||
│ ├── p1-3-parse-md-blocks.md
|
||
│ ├── p1-4-normalize.md
|
||
│ ├── p1-5-chunk.md
|
||
│ └── p1-6-store-sqlite.md
|
||
├── p2/
|
||
│ ├── p2-1-fts-schema.md
|
||
│ └── p2-2-lexical-retriever.md
|
||
├── p3/
|
||
│ ├── p3-1-embedder-trait.md
|
||
│ ├── p3-2-fastembed-adapter.md
|
||
│ ├── p3-3-lancedb-store.md
|
||
│ └── p3-4-hybrid-fusion.md
|
||
├── p4/
|
||
│ ├── p4-1-llm-trait.md
|
||
│ ├── p4-2-ollama-adapter.md
|
||
│ └── p4-3-rag-pipeline.md
|
||
├── p5/
|
||
│ ├── p5-1-golden-fixture-runner.md
|
||
│ └── p5-2-metrics-compare.md
|
||
├── p6/
|
||
│ ├── p6-1-image-extractor-exif.md
|
||
│ ├── p6-2-ocr-adapter.md
|
||
│ └── p6-3-caption-adapter.md
|
||
├── p7/
|
||
│ ├── p7-1-pdf-text-extractor.md
|
||
│ └── p7-2-pdf-page-chunker.md
|
||
├── p8/
|
||
│ ├── p8-1-whisper-adapter.md
|
||
│ └── p8-2-segment-chunker.md
|
||
└── p9/
|
||
├── p9-1-tui-library.md
|
||
├── p9-2-tui-search.md
|
||
├── p9-3-tui-ask.md
|
||
├── p9-4-tui-inspect.md
|
||
└── p9-5-desktop-tauri.md
|
||
```
|
||
|
||
Existing per-phase epic files (`tasks/phase-0-skeleton.md` … `phase-9-ui.md`) stay as epic-level overviews. Component task files under `tasks/p<N>/` are the actual unit-of-work for AI sub-agents.
|
||
|
||
**Acceptance for plan as a whole:**
|
||
- `tasks/_template.md` exists with all required sections.
|
||
- `tasks/p1/*.md` (6 files) exist, each cites the frozen design doc, lists Allowed/Forbidden deps, has self-contained Test plan.
|
||
- `tasks/p0/*.md`, `tasks/p2..p9/*.md` (~24 files) follow the same template.
|
||
- `tasks/INDEX.md` updated to link component tasks under each phase.
|
||
- `cargo` is not run in this plan (no code).
|
||
|
||
---
|
||
|
||
## Phase A — Authoring the task spec template
|
||
|
||
### Task A1: Write `tasks/_template.md`
|
||
|
||
**Files:**
|
||
- Create: `tasks/_template.md`
|
||
|
||
- [ ] **Step 1: Verify the design doc path resolves**
|
||
|
||
```bash
|
||
test -f docs/superpowers/specs/2026-04-27-kebab-final-form-design.md && echo OK
|
||
```
|
||
|
||
Expected: `OK`
|
||
|
||
- [ ] **Step 2: Write the template file**
|
||
|
||
Write the following content verbatim to `tasks/_template.md`:
|
||
|
||
````markdown
|
||
---
|
||
phase: P<N>
|
||
component: <crate-or-module-name>
|
||
task_id: p<N>-<i>
|
||
title: "<Component title>"
|
||
status: planned
|
||
depends_on: [] # other task_ids
|
||
unblocks: [] # other task_ids
|
||
contract_source: ../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
contract_sections: [] # e.g. [§3.5, §5.5, §7.2]
|
||
---
|
||
|
||
# <task_id> — <Component title>
|
||
|
||
## Goal
|
||
|
||
<One sentence. The user-facing outcome of this task.>
|
||
|
||
## Why now / why this size
|
||
|
||
<One paragraph. Why this is the right unit of work and how it slots into the phase.>
|
||
|
||
## Allowed dependencies
|
||
|
||
- `kebab-core`
|
||
- <other crates per design §8>
|
||
- <external crates with versions>
|
||
|
||
## Forbidden dependencies
|
||
|
||
- <list — every crate banned per design §8 Allowed/Forbidden table>
|
||
|
||
If any item here is needed during implementation, STOP and update the frozen design doc first.
|
||
|
||
## Inputs
|
||
|
||
| input | type | source |
|
||
|-------|------|--------|
|
||
| ... | ... | ... |
|
||
|
||
## Outputs
|
||
|
||
| output | type | downstream consumer |
|
||
|--------|------|---------------------|
|
||
| ... | ... | ... |
|
||
|
||
## Public surface (signatures only — no new types)
|
||
|
||
```rust
|
||
// Cite only types/traits already defined in the frozen design doc.
|
||
// If a new helper is needed, mark it "internal" and keep it crate-private.
|
||
```
|
||
|
||
## Behavior contract
|
||
|
||
- <bullet list of must-hold invariants>
|
||
- <reference to design doc section numbers>
|
||
- <determinism / version recording / error policy>
|
||
|
||
## Storage / wire effects
|
||
|
||
- DB tables touched (read/write)
|
||
- LanceDB tables touched (read/write)
|
||
- Filesystem paths created/read
|
||
- Wire schema objects emitted (must conform to `*.v1`)
|
||
|
||
## Test plan
|
||
|
||
| kind | description | fixture / data |
|
||
|------|-------------|----------------|
|
||
| unit | ... | ... |
|
||
| snapshot | ... (JSON freeze) | `fixtures/...` |
|
||
| contract | trait round-trip | mock impls |
|
||
| integration | end-to-end via `kebab-app` facade | tmp workspace |
|
||
|
||
All tests must run under `cargo test -p <crate>` and not require external network or Ollama unless explicitly stated.
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] `cargo check -p <crate>` passes
|
||
- [ ] `cargo test -p <crate>` passes
|
||
- [ ] No imports outside Allowed dependencies
|
||
- [ ] All emitted wire JSON validates against `docs/wire-schema/v1/<schema>.schema.json` (when applicable)
|
||
- [ ] All record version fields populated per design §9
|
||
- [ ] PR body links the relevant design section numbers
|
||
|
||
## Out of scope
|
||
|
||
- <explicit list — features that other tasks cover>
|
||
- <future-phase work>
|
||
|
||
## Risks / notes
|
||
|
||
- <one paragraph max — known traps, version coupling, perf concerns>
|
||
````
|
||
|
||
- [ ] **Step 3: Verify file exists and is non-trivial**
|
||
|
||
```bash
|
||
test -s tasks/_template.md && wc -l tasks/_template.md
|
||
```
|
||
|
||
Expected: > 50 lines reported.
|
||
|
||
- [ ] **Step 4: Commit**
|
||
|
||
```bash
|
||
git add tasks/_template.md
|
||
git commit -m "tasks: add component task spec template"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase B — P1 (Markdown ingestion) decomposition
|
||
|
||
P1 epic: [tasks/phase-1-markdown-ingestion.md](../../../tasks/phase-1-markdown-ingestion.md). 6 component tasks. Each cites the frozen design doc sections and lists allowed/forbidden deps per design §8.
|
||
|
||
### Task B0: Create P1 directory and update INDEX
|
||
|
||
**Files:**
|
||
- Create: `tasks/p1/` (directory)
|
||
- Modify: `tasks/INDEX.md`
|
||
|
||
- [ ] **Step 1: Create directory**
|
||
|
||
```bash
|
||
mkdir -p tasks/p1
|
||
```
|
||
|
||
- [ ] **Step 2: Append component-task subsection to `tasks/INDEX.md`**
|
||
|
||
Add this section near the end of `tasks/INDEX.md` (just before the "## 모든 task 공통 규약" heading):
|
||
|
||
```markdown
|
||
## Component task decomposition (per phase)
|
||
|
||
각 phase 의 component-level 분해. AI sub-agent 1세션 = 1 task 가 sweet spot.
|
||
|
||
- P1 — [p1/](p1/) — Markdown ingestion 6 components
|
||
- [p1-1 source-fs](p1/p1-1-source-fs.md)
|
||
- [p1-2 parse-md frontmatter](p1/p1-2-parse-md-frontmatter.md)
|
||
- [p1-3 parse-md blocks](p1/p1-3-parse-md-blocks.md)
|
||
- [p1-4 normalize](p1/p1-4-normalize.md)
|
||
- [p1-5 chunk](p1/p1-5-chunk.md)
|
||
- [p1-6 store-sqlite](p1/p1-6-store-sqlite.md)
|
||
```
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```bash
|
||
git add tasks/p1 tasks/INDEX.md
|
||
git commit -m "tasks: prepare P1 component decomposition skeleton"
|
||
```
|
||
|
||
### Task B1: `p1-1-source-fs.md` (kebab-source-fs)
|
||
|
||
**Files:**
|
||
- Create: `tasks/p1/p1-1-source-fs.md`
|
||
|
||
- [ ] **Step 1: Write the spec**
|
||
|
||
Write the following content to `tasks/p1/p1-1-source-fs.md`:
|
||
|
||
````markdown
|
||
---
|
||
phase: P1
|
||
component: kebab-source-fs
|
||
task_id: p1-1
|
||
title: "Local filesystem source connector"
|
||
status: planned
|
||
depends_on: [p0-1]
|
||
unblocks: [p1-2, p1-3, p1-4, p1-5, p1-6]
|
||
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
contract_sections: [§3.3, §6.2, §6.6, §7.1, §7.2 SourceConnector, §8]
|
||
---
|
||
|
||
# p1-1 — Local filesystem source connector
|
||
|
||
## Goal
|
||
|
||
Walk the workspace root, apply gitignore-style filters, compute BLAKE3 checksums, and produce `Vec<RawAsset>`.
|
||
|
||
## Why now / why this size
|
||
|
||
`SourceConnector` is the entry point of every ingest. Stable `RawAsset` output unblocks every downstream P1 task (parser, normalize, chunk, store). Small enough to deliver in one PR with full test coverage.
|
||
|
||
## Allowed dependencies
|
||
|
||
- `kebab-core`
|
||
- `kebab-config`
|
||
- `ignore` (gitignore semantics)
|
||
- `blake3`
|
||
- `walkdir`
|
||
- `time`
|
||
- `serde`
|
||
- `thiserror`
|
||
- `tracing`
|
||
|
||
## Forbidden dependencies
|
||
|
||
- `kebab-parse-*`, `kebab-normalize`, `kebab-chunk`, `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
|
||
|
||
## Inputs
|
||
|
||
| input | type | source |
|
||
|-------|------|--------|
|
||
| `SourceScope` | `kebab_core::SourceScope` | `kebab-app` from config |
|
||
| filesystem | `&Path` | OS |
|
||
| `.kebabignore` | text file | workspace root, optional |
|
||
|
||
## Outputs
|
||
|
||
| output | type | downstream consumer |
|
||
|--------|------|---------------------|
|
||
| `Vec<RawAsset>` | `kebab_core::RawAsset` | `kebab-parse-md`, asset writer in `kebab-store-sqlite` (via `kebab-app`) |
|
||
|
||
## Public surface (signatures only — no new types)
|
||
|
||
```rust
|
||
pub struct FsSourceConnector { /* internal */ }
|
||
|
||
impl FsSourceConnector {
|
||
pub fn new(config: &kebab_config::Config) -> anyhow::Result<Self>;
|
||
}
|
||
|
||
impl kebab_core::SourceConnector for FsSourceConnector {
|
||
fn scan(&self, scope: &kebab_core::SourceScope) -> anyhow::Result<Vec<kebab_core::RawAsset>>;
|
||
}
|
||
```
|
||
|
||
## Behavior contract
|
||
|
||
- POSIX-normalize every emitted `workspace_path` (NFC, leading `./` stripped, single `/`).
|
||
- `asset_id` derived per design §4.2 from `blake3(raw bytes)` full hex.
|
||
- `media_type` selected from extension + libmagic-like sniff fallback (`.md` → Markdown, others fall through to `MediaType::Other`).
|
||
- `discovered_at` = current `OffsetDateTime::now_utc()` at scan time.
|
||
- Combine `config.workspace.exclude` ∪ `.kebabignore` for filter (union; ordering does not matter).
|
||
- Symbolic links: follow once, detect cycles via `canonicalize` + visited set.
|
||
- Files larger than `storage.copy_threshold_mb` MB → emit `AssetStorage::Reference { path, sha }` (do not copy bytes here; copying is done by the asset writer task).
|
||
- Idempotent: same input → same `Vec<RawAsset>` (sort by `workspace_path`).
|
||
|
||
## Storage / wire effects
|
||
|
||
- Reads: filesystem under `config.workspace.root`.
|
||
- Writes: nothing. (Asset copy is handled by the asset writer in `kebab-store-sqlite`.)
|
||
|
||
## Test plan
|
||
|
||
| kind | description | fixture / data |
|
||
|------|-------------|----------------|
|
||
| unit | POSIX path normalization | inline cases incl. `./a/b.md`, `a//b.md`, `a/b.md` → identical |
|
||
| unit | blake3 of known bytes matches expected hex | inline |
|
||
| unit | gitignore filter (`*.tmp`, `node_modules/**`) excludes correctly | tmp tree built in test |
|
||
| unit | `.kebabignore` ∪ config exclude works | tmp tree |
|
||
| unit | symlink cycle does not loop | tmp tree with `a -> b -> a` |
|
||
| snapshot | `Vec<RawAsset>` serialized JSON for fixture tree is stable | `fixtures/source-fs/tree-1` |
|
||
| determinism | re-running scan twice produces byte-identical JSON | `fixtures/source-fs/tree-1` |
|
||
|
||
All tests run under `cargo test -p kebab-source-fs` with no network and no model.
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] `cargo check -p kebab-source-fs` passes
|
||
- [ ] `cargo test -p kebab-source-fs` passes
|
||
- [ ] Snapshot test `fixtures/source-fs/tree-1` round-trips deterministically
|
||
- [ ] No imports outside Allowed dependencies (verified via `cargo tree -p kebab-source-fs`)
|
||
- [ ] PR description links to design §3.3, §6.2, §7.2
|
||
|
||
## Out of scope
|
||
|
||
- File watching (P+).
|
||
- Asset copy/reference storage on disk (`kebab-store-sqlite` task p1-6).
|
||
- Non-fs source connectors (HTTP, S3 — P+).
|
||
|
||
## Risks / notes
|
||
|
||
- BLAKE3 of large files (>1 GB) is fast but allocate streaming; do not load whole file in memory.
|
||
- macOS resource forks / `.DS_Store` should be excluded by default.
|
||
````
|
||
|
||
- [ ] **Step 2: Verify file exists**
|
||
|
||
```bash
|
||
test -s tasks/p1/p1-1-source-fs.md && echo OK
|
||
```
|
||
|
||
Expected: `OK`
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```bash
|
||
git add tasks/p1/p1-1-source-fs.md
|
||
git commit -m "tasks: add p1-1 source-fs component spec"
|
||
```
|
||
|
||
### Task B2: `p1-2-parse-md-frontmatter.md`
|
||
|
||
**Files:**
|
||
- Create: `tasks/p1/p1-2-parse-md-frontmatter.md`
|
||
|
||
- [ ] **Step 1: Write the spec**
|
||
|
||
Write to `tasks/p1/p1-2-parse-md-frontmatter.md`:
|
||
|
||
````markdown
|
||
---
|
||
phase: P1
|
||
component: kebab-parse-md (frontmatter submodule)
|
||
task_id: p1-2
|
||
title: "Markdown frontmatter parsing → Metadata"
|
||
status: planned
|
||
depends_on: [p0-1]
|
||
unblocks: [p1-4]
|
||
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
contract_sections: [§3.6 Metadata, §0 Q9 frontmatter, §10 errors]
|
||
---
|
||
|
||
# p1-2 — Markdown frontmatter parsing
|
||
|
||
## Goal
|
||
|
||
Parse YAML/TOML frontmatter from Markdown bytes into `kebab_core::Metadata`, with auto-derive defaults and unknown-key preservation in `metadata.user`.
|
||
|
||
## Why now / why this size
|
||
|
||
Frontmatter is small but contractually load-bearing (Q9 spec). Isolating it from block parsing keeps both halves of `kebab-parse-md` simple and lets us reach 100% test coverage on the rules in design §0 Q9.
|
||
|
||
## Allowed dependencies
|
||
|
||
- `kebab-core`
|
||
- `serde`
|
||
- `serde_yaml` (or `yaml-rust2`) for YAML
|
||
- `toml` for TOML
|
||
- `time`
|
||
- `lingua` (lang auto-detect — accept feature-gate if heavy)
|
||
- `thiserror`
|
||
|
||
## Forbidden dependencies
|
||
|
||
- `kebab-store-*`, `kebab-llm*`, `kebab-rag`, `kebab-embed*`, `kebab-search`, `kebab-tui`, `kebab-desktop`, `kebab-source-fs`, `kebab-chunk`, `kebab-normalize`, `pulldown-cmark` (block parser is a sibling task)
|
||
|
||
## Inputs
|
||
|
||
| input | type | source |
|
||
|-------|------|--------|
|
||
| Markdown bytes | `&[u8]` | extractor |
|
||
| body fallbacks | `BodyHints { first_h1: Option<String>, fs_ctime: OffsetDateTime, fs_mtime: OffsetDateTime, fallback_lang: Option<String> }` | caller |
|
||
|
||
## Outputs
|
||
|
||
| output | type | downstream |
|
||
|--------|------|------------|
|
||
| `(Metadata, Option<FrontmatterSpan>, Vec<Warning>)` | tuple | `kebab-normalize` → CanonicalDocument |
|
||
|
||
## Public surface (signatures only — no new types)
|
||
|
||
```rust
|
||
pub fn parse_frontmatter(
|
||
bytes: &[u8],
|
||
hints: &BodyHints,
|
||
) -> anyhow::Result<(kebab_core::Metadata, Option<FrontmatterSpan>, Vec<Warning>)>;
|
||
```
|
||
|
||
`FrontmatterSpan` and `Warning` are crate-internal helpers; if any new public type is needed, STOP and update the frozen design doc first.
|
||
|
||
## Behavior contract
|
||
|
||
- All Metadata fields are optional in input. Missing fields populated per design §0 Q9 derive table:
|
||
- `title` ← first H1 (from `BodyHints.first_h1`) → filename without extension if no H1.
|
||
- `lang` ← lingua auto-detect on first 4 KB of body → fallback `BodyHints.fallback_lang` or `"und"`.
|
||
- `created_at` / `updated_at` ← `BodyHints.fs_ctime` / `fs_mtime` if missing.
|
||
- `source_type` default `markdown`; `trust_level` default `primary`.
|
||
- `aliases`, `tags` default empty.
|
||
- Unknown keys → `metadata.user` (`serde_json::Map`), preserved verbatim, no warning.
|
||
- Unknown enum value (e.g. `trust_level: weird`) → warning + replaced with default; ingest continues.
|
||
- Malformed YAML → frontmatter discarded, body still parsed, warning emitted.
|
||
- No frontmatter at all → defaults applied silently.
|
||
- `id:` field captured into `metadata.user_id_alias` (alias only — does NOT influence `doc_id` per design §4.2).
|
||
|
||
## Storage / wire effects
|
||
|
||
- None. Pure function.
|
||
|
||
## Test plan
|
||
|
||
| kind | description | fixture / data |
|
||
|------|-------------|----------------|
|
||
| unit | YAML frontmatter happy path → Metadata fields | inline |
|
||
| unit | TOML frontmatter happy path | inline |
|
||
| unit | unknown keys preserved in `metadata.user` | inline |
|
||
| unit | unknown enum value → warning + default | inline |
|
||
| unit | malformed YAML → empty Metadata + warning | inline |
|
||
| unit | no frontmatter → derive from BodyHints | inline |
|
||
| unit | `id:` field becomes `user_id_alias`, not `doc_id` factor | inline + assert via §4.2 recipe stub |
|
||
| snapshot | `fixtures/markdown/frontmatter-only.md` produces stable JSON | fixture |
|
||
| snapshot | mixed-language body with no `lang:` detects `ko` or `en` | `fixtures/markdown/mixed-lang.md` |
|
||
|
||
All tests under `cargo test -p kebab-parse-md --lib frontmatter`.
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] `cargo check -p kebab-parse-md` passes
|
||
- [ ] `cargo test -p kebab-parse-md frontmatter` passes
|
||
- [ ] No `pulldown-cmark` import in this submodule
|
||
- [ ] Snapshot tests stable across two consecutive runs
|
||
- [ ] PR links design §0 Q9, §3.6
|
||
|
||
## Out of scope
|
||
|
||
- Block parsing (p1-3).
|
||
- Building `CanonicalDocument` (p1-4).
|
||
- Persisting metadata (p1-6).
|
||
|
||
## Risks / notes
|
||
|
||
- `lingua` model load is heavy on first call; tests should reuse a static instance.
|
||
- timezone normalization: parse `created_at`/`updated_at` to UTC; preserve original offset only in `metadata.user.original_timestamps` if present and non-UTC.
|
||
````
|
||
|
||
- [ ] **Step 2: Verify and commit**
|
||
|
||
```bash
|
||
test -s tasks/p1/p1-2-parse-md-frontmatter.md && echo OK
|
||
git add tasks/p1/p1-2-parse-md-frontmatter.md
|
||
git commit -m "tasks: add p1-2 parse-md frontmatter component spec"
|
||
```
|
||
|
||
Expected: `OK`, then commit succeeds.
|
||
|
||
### Task B3: `p1-3-parse-md-blocks.md`
|
||
|
||
**Files:**
|
||
- Create: `tasks/p1/p1-3-parse-md-blocks.md`
|
||
|
||
- [ ] **Step 1: Write the spec**
|
||
|
||
Write to `tasks/p1/p1-3-parse-md-blocks.md`:
|
||
|
||
````markdown
|
||
---
|
||
phase: P1
|
||
component: kebab-parse-md (blocks submodule)
|
||
task_id: p1-3
|
||
title: "Markdown body → Block tree with line spans"
|
||
status: planned
|
||
depends_on: [p0-1]
|
||
unblocks: [p1-4]
|
||
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
contract_sections: [§3.4 Block, §3.4 SourceSpan, §0 Q3 citation]
|
||
---
|
||
|
||
# p1-3 — Markdown body → Block tree
|
||
|
||
## Goal
|
||
|
||
Parse Markdown body bytes into a flat `Vec<ParsedBlock>` (intermediate, crate-private) with heading paths and line ranges preserved, ready for `kebab-normalize` to lift into `CanonicalDocument`.
|
||
|
||
## Why now / why this size
|
||
|
||
This is the heaviest part of P1 parser. Separating it from frontmatter and from normalization keeps each piece tractable. Determinism of line ranges directly determines citation quality (design §0 Q3 / §3.4 SourceSpan::Line).
|
||
|
||
## Allowed dependencies
|
||
|
||
- `kebab-core`
|
||
- `pulldown-cmark` (CommonMark with source-map; GFM tables enabled via feature)
|
||
- `serde`
|
||
- `thiserror`
|
||
|
||
## Forbidden dependencies
|
||
|
||
- `kebab-store-*`, `kebab-llm*`, `kebab-rag`, `kebab-embed*`, `kebab-search`, `kebab-source-fs`, `kebab-chunk`, `kebab-normalize`, `kebab-tui`, `kebab-desktop`, `comrak` (alternative parser; pick one)
|
||
|
||
## Inputs
|
||
|
||
| input | type | source |
|
||
|-------|------|--------|
|
||
| Markdown body bytes | `&[u8]` | extractor (after frontmatter stripped) |
|
||
| `body_offset_lines` | `u32` | extractor (so line ranges are reported relative to original file) |
|
||
|
||
## Outputs
|
||
|
||
| output | type | downstream |
|
||
|--------|------|------------|
|
||
| `Vec<ParsedBlock>` (intermediate type, crate-private) | – | `kebab-normalize` |
|
||
| `Vec<Warning>` | – | propagated into Provenance |
|
||
|
||
## Public surface (signatures only — no new types)
|
||
|
||
```rust
|
||
pub fn parse_blocks(body: &[u8], body_offset_lines: u32) -> anyhow::Result<(Vec<ParsedBlock>, Vec<Warning>)>;
|
||
```
|
||
|
||
`ParsedBlock` is a crate-internal mirror that maps 1:1 to `kebab_core::Block` variants once `kebab-normalize` assigns `BlockId`s.
|
||
|
||
## Behavior contract
|
||
|
||
- Source-map: each `ParsedBlock` carries `SourceSpan::Line { start, end }` relative to the original file (i.e., add `body_offset_lines`).
|
||
- Heading tree: every block records its ancestor heading texts in order (e.g., `["아키텍처", "Chunking 정책"]`).
|
||
- Code blocks: language tag preserved (` ```rust ` → `Some("rust")`), fenced content not split.
|
||
- Tables: GFM tables produce `TableBlock` with header row + body rows; if a table cell is malformed, fall back to a `Paragraph` block + warning.
|
||
- Image references: `` produces `ImageRefBlock` with `asset_id = None`, `src = "..."`, `alt = "..."`. Resolution to `AssetId` happens later in `kebab-normalize`.
|
||
- Lists: ordered/unordered preserved; nested list items flattened into one `ListBlock` with each top-level item's text.
|
||
- Inline elements: only `Text`, `Code`, `Link`, `Strong`, `Emph` (per design §3.4). Drop other inlines silently.
|
||
- Malformed input never panics. Worst case: empty `Vec<ParsedBlock>` + warning.
|
||
|
||
## Storage / wire effects
|
||
|
||
- None.
|
||
|
||
## Test plan
|
||
|
||
| kind | description | fixture / data |
|
||
|------|-------------|----------------|
|
||
| unit | heading tree depth + heading_path correctness | inline |
|
||
| unit | code block lang tag preserved | inline |
|
||
| unit | GFM table parses; malformed table degrades to paragraph + warning | inline |
|
||
| unit | line range correct under various line-ending styles (LF / CRLF) | inline |
|
||
| unit | image ref captured with src/alt | inline |
|
||
| unit | nested list flattens correctly | inline |
|
||
| unit | malformed input does not panic | inline (random byte slices) |
|
||
| snapshot | `fixtures/markdown/nested-headings.md` → ParsedBlock JSON stable | fixture |
|
||
| snapshot | `fixtures/markdown/code-and-table.md` → JSON stable | fixture |
|
||
|
||
All tests under `cargo test -p kebab-parse-md --lib blocks`.
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] `cargo check -p kebab-parse-md` passes
|
||
- [ ] `cargo test -p kebab-parse-md blocks` passes
|
||
- [ ] Snapshot tests stable across two runs
|
||
- [ ] No imports outside Allowed dependencies
|
||
- [ ] PR links design §3.4
|
||
|
||
## Out of scope
|
||
|
||
- Frontmatter (p1-2).
|
||
- Lifting `ParsedBlock` → `kebab_core::Block` with `BlockId` (p1-4).
|
||
- Chunking (p1-5).
|
||
|
||
## Risks / notes
|
||
|
||
- `pulldown-cmark` source-map may not include exact byte ranges for all event kinds; line ranges are the binding contract per design (line-range citation is the primary form for Markdown).
|
||
- CRLF normalization: convert internally to LF for span math but report line numbers from the original byte stream.
|
||
````
|
||
|
||
- [ ] **Step 2: Verify and commit**
|
||
|
||
```bash
|
||
test -s tasks/p1/p1-3-parse-md-blocks.md && echo OK
|
||
git add tasks/p1/p1-3-parse-md-blocks.md
|
||
git commit -m "tasks: add p1-3 parse-md blocks component spec"
|
||
```
|
||
|
||
### Task B4: `p1-4-normalize.md`
|
||
|
||
**Files:**
|
||
- Create: `tasks/p1/p1-4-normalize.md`
|
||
|
||
- [ ] **Step 1: Write the spec**
|
||
|
||
Write to `tasks/p1/p1-4-normalize.md`:
|
||
|
||
````markdown
|
||
---
|
||
phase: P1
|
||
component: kebab-normalize
|
||
task_id: p1-4
|
||
title: "Lift parser output → CanonicalDocument with deterministic IDs"
|
||
status: planned
|
||
depends_on: [p1-2, p1-3]
|
||
unblocks: [p1-5, p1-6]
|
||
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
contract_sections: [§3.4, §4 ID recipe, §3.6 Provenance]
|
||
---
|
||
|
||
# p1-4 — Lift to CanonicalDocument
|
||
|
||
## Goal
|
||
|
||
Combine `Metadata` (p1-2) + `Vec<ParsedBlock>` (p1-3) + `RawAsset` (p1-1) into a `CanonicalDocument` with deterministic `doc_id` and `block_id`s per design §4 recipe.
|
||
|
||
## Why now / why this size
|
||
|
||
Single responsibility: ID generation + struct assembly. Keeps `kebab-parse-md` purely a parser and isolates the (security-critical) deterministic ID logic in one crate.
|
||
|
||
## Allowed dependencies
|
||
|
||
- `kebab-core`
|
||
- `kebab-config`
|
||
- `serde`
|
||
- `serde-json-canonicalizer` (canonical JSON for ID hashing)
|
||
- `blake3`
|
||
- `unicode-normalization` (NFC)
|
||
- `time`
|
||
- `thiserror`
|
||
|
||
## Forbidden dependencies
|
||
|
||
- `kebab-source-fs`, `kebab-parse-md` (consumed via plain types only — must not couple back), `kebab-chunk`, `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
|
||
|
||
Note: this crate accepts `ParsedBlock` from `kebab-parse-md` either by (a) exposing `ParsedBlock` as a `kebab-core` type, or (b) `kebab-parse-md` re-exporting via a public DTO. Pick (a): move `ParsedBlock` into `kebab-core` so this task does not import `kebab-parse-md`.
|
||
|
||
## Inputs
|
||
|
||
| input | type | source |
|
||
|-------|------|--------|
|
||
| `RawAsset` | `kebab_core::RawAsset` | p1-1 |
|
||
| `Metadata` + frontmatter span + warnings | from p1-2 | parser caller |
|
||
| `Vec<ParsedBlock>` + warnings | from p1-3 | parser caller |
|
||
| `parser_version` | `kebab_core::ParserVersion` | constant in `kebab-parse-md` |
|
||
|
||
## Outputs
|
||
|
||
| output | type | downstream |
|
||
|--------|------|------------|
|
||
| `CanonicalDocument` | `kebab_core::CanonicalDocument` | `kebab-chunk`, `kebab-store-sqlite` |
|
||
|
||
## Public surface (signatures only — no new types)
|
||
|
||
```rust
|
||
pub fn build_canonical_document(
|
||
asset: &kebab_core::RawAsset,
|
||
metadata: kebab_core::Metadata,
|
||
blocks: Vec<kebab_core::ParsedBlock>,
|
||
parser_version: &kebab_core::ParserVersion,
|
||
warnings: Vec<Warning>,
|
||
) -> anyhow::Result<kebab_core::CanonicalDocument>;
|
||
|
||
pub fn id_for_doc(workspace_path: &kebab_core::WorkspacePath, asset: &kebab_core::AssetId, parser_version: &kebab_core::ParserVersion) -> kebab_core::DocumentId;
|
||
pub fn id_for_block(doc: &kebab_core::DocumentId, kind: &str, heading_path: &[String], ordinal: u32, span: &kebab_core::SourceSpan) -> kebab_core::BlockId;
|
||
```
|
||
|
||
## Behavior contract
|
||
|
||
- ID generation strictly follows design §4.2 (canonical JSON of tagged tuple, blake3 hex truncated to 32 chars).
|
||
- `block_id` ordinal: per `(heading_path, kind)` group, 0-based, in document order.
|
||
- All input strings normalized to NFC before hashing.
|
||
- POSIX path normalization applied to `workspace_path`.
|
||
- Unicode line endings normalized internally; `SourceSpan::Line` indices preserved as-is from p1-3.
|
||
- `Provenance` built with one event per pipeline stage encountered: `Discovered`, `Parsed`, `Normalized`. Warnings appended as `ProvenanceKind::Warning` with `note`.
|
||
- Determinism property test: same inputs → byte-identical `CanonicalDocument` JSON, including ID stability across runs.
|
||
|
||
## Storage / wire effects
|
||
|
||
- None.
|
||
|
||
## Test plan
|
||
|
||
| kind | description | fixture / data |
|
||
|------|-------------|----------------|
|
||
| unit | id_for_doc deterministic across 1000 runs | inline |
|
||
| unit | NFC vs NFD Korean inputs produce identical IDs | inline |
|
||
| unit | POSIX path with `./` and `//` collapse to same `doc_id` | inline |
|
||
| unit | block ordinal numbering inside same heading_path is correct | inline |
|
||
| unit | provenance contains Discovered/Parsed/Normalized in order | inline |
|
||
| snapshot | `fixtures/markdown/code-and-table.md` → CanonicalDocument JSON stable (incl. all IDs) | fixture |
|
||
|
||
All tests under `cargo test -p kebab-normalize`.
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] `cargo check -p kebab-normalize` passes
|
||
- [ ] `cargo test -p kebab-normalize` passes
|
||
- [ ] Determinism test runs ≥ 1000 iterations under 1 second
|
||
- [ ] No `kebab-parse-md` import (consumed via `kebab-core::ParsedBlock`)
|
||
- [ ] PR links design §4.2, §4.3
|
||
|
||
## Out of scope
|
||
|
||
- Chunking (p1-5).
|
||
- DB writes (p1-6).
|
||
- Block validation beyond what is needed to assign IDs (e.g., we do NOT verify image src exists on disk here).
|
||
|
||
## Risks / notes
|
||
|
||
- If ID recipe changes, all dependent records become stale. Treat any change to `id_for_doc`/`id_for_block` as a `parser_version` bump (design §9).
|
||
````
|
||
|
||
- [ ] **Step 2: Verify and commit**
|
||
|
||
```bash
|
||
test -s tasks/p1/p1-4-normalize.md && echo OK
|
||
git add tasks/p1/p1-4-normalize.md
|
||
git commit -m "tasks: add p1-4 normalize component spec"
|
||
```
|
||
|
||
### Task B5: `p1-5-chunk.md`
|
||
|
||
**Files:**
|
||
- Create: `tasks/p1/p1-5-chunk.md`
|
||
|
||
- [ ] **Step 1: Write the spec**
|
||
|
||
Write to `tasks/p1/p1-5-chunk.md`:
|
||
|
||
````markdown
|
||
---
|
||
phase: P1
|
||
component: kebab-chunk
|
||
task_id: p1-5
|
||
title: "Markdown heading-aware chunker (md-heading-v1)"
|
||
status: planned
|
||
depends_on: [p1-4]
|
||
unblocks: [p1-6, p2-2, p3-2]
|
||
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
contract_sections: [§3.5 Chunk, §4.2 chunk_id recipe, §7.2 Chunker, §0 Q3 citation]
|
||
---
|
||
|
||
# p1-5 — Markdown heading-aware chunker
|
||
|
||
## Goal
|
||
|
||
Implement `Chunker` trait emitting `chunker_version = "md-heading-v1"`. Block-aware: respect heading boundaries, never split code/table, propagate `heading_path` and merged `source_spans`.
|
||
|
||
## Why now / why this size
|
||
|
||
The first concrete `Chunker`. Establishes how subsequent chunkers (PDF page chunker, audio segment chunker) are scoped: per-medium chunker version label. Independent of any store/embed.
|
||
|
||
## Allowed dependencies
|
||
|
||
- `kebab-core`
|
||
- `kebab-config`
|
||
- `serde`
|
||
- `blake3` (policy_hash)
|
||
- `serde-json-canonicalizer`
|
||
- `thiserror`
|
||
|
||
## Forbidden dependencies
|
||
|
||
- `kebab-source-fs`, `kebab-parse-md`, `kebab-normalize` (consumes `CanonicalDocument` only via `kebab-core`), `kebab-store-*`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
|
||
|
||
## Inputs
|
||
|
||
| input | type | source |
|
||
|-------|------|--------|
|
||
| `CanonicalDocument` | `kebab_core::CanonicalDocument` | p1-4 |
|
||
| `ChunkPolicy` | `kebab_core::ChunkPolicy` | `kebab-app` from config |
|
||
|
||
## Outputs
|
||
|
||
| output | type | downstream |
|
||
|--------|------|------------|
|
||
| `Vec<Chunk>` | `kebab_core::Chunk` | `kebab-store-sqlite` (p1-6), `kebab-embed*` (P3) |
|
||
|
||
## Public surface (signatures only — no new types)
|
||
|
||
```rust
|
||
pub struct MdHeadingV1Chunker;
|
||
|
||
impl kebab_core::Chunker for MdHeadingV1Chunker {
|
||
fn chunker_version(&self) -> kebab_core::ChunkerVersion;
|
||
fn policy_hash(&self, policy: &kebab_core::ChunkPolicy) -> String;
|
||
fn chunk(&self, doc: &kebab_core::CanonicalDocument, policy: &kebab_core::ChunkPolicy) -> anyhow::Result<Vec<kebab_core::Chunk>>;
|
||
}
|
||
```
|
||
|
||
`policy_hash` = `blake3(canonical_json(policy))` hex truncated to 16 chars.
|
||
|
||
## Behavior contract
|
||
|
||
- Priority order (per design §0 / report §14):
|
||
1. heading boundary first
|
||
2. never split a code block
|
||
3. table stays in a single chunk if possible
|
||
4. long sections split by paragraph
|
||
5. propagate `heading_path` from blocks
|
||
6. carry merged `source_spans` (each chunk lists every contributing block's span)
|
||
7. record `chunker_version = "md-heading-v1"` and `policy_hash`
|
||
- `target_tokens` and `overlap_tokens` from `ChunkPolicy`. Token estimate is byte-based proxy until a real tokenizer is introduced (note in `Chunk.token_estimate`).
|
||
- `chunk_id` per design §4.2: tagged tuple of `(doc_id, chunker_version, block_ids, policy_hash)`.
|
||
- `block_ids` listed in document order (significant — affects ID).
|
||
- ImageRef / AudioRef blocks are emitted as their own chunks (text portion = alt + caption preview if present, else empty string with `token_estimate=0`). They still receive `chunk_id` so future image/audio search can locate them.
|
||
|
||
## Storage / wire effects
|
||
|
||
- None directly. Outputs feed p1-6.
|
||
|
||
## Test plan
|
||
|
||
| kind | description | fixture / data |
|
||
|------|-------------|----------------|
|
||
| unit | heading boundary respected (no chunk crosses H2 → H2) | inline |
|
||
| unit | code block of 800 tokens stays in one chunk even when target=500 | inline |
|
||
| unit | table block stays single chunk if size < 2× target | inline |
|
||
| unit | long paragraph split with overlap_tokens applied | inline |
|
||
| unit | ImageRefBlock produces a chunk with token_estimate=0 | inline |
|
||
| determinism | identical input + identical policy → identical chunk_ids | inline |
|
||
| snapshot | `fixtures/markdown/long-section.md` → Vec<Chunk> JSON stable | fixture |
|
||
|
||
All tests under `cargo test -p kebab-chunk`.
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] `cargo check -p kebab-chunk` passes
|
||
- [ ] `cargo test -p kebab-chunk` passes
|
||
- [ ] Snapshot stable across two runs
|
||
- [ ] No imports outside Allowed dependencies
|
||
- [ ] PR links design §3.5, §4.2
|
||
|
||
## Out of scope
|
||
|
||
- DB persistence (p1-6).
|
||
- Embedding (P3).
|
||
- Reranking / hybrid (P3).
|
||
|
||
## Risks / notes
|
||
|
||
- Token estimate proxy: a real tokenizer (e.g., sentencepiece for the embedding model) replaces this in P3. The proxy must err toward overestimation so chunks fit in real tokenizer budget.
|
||
- Changing `chunker_version` invalidates all downstream embedding records. Bump only with PR documenting the migration plan (design §9).
|
||
````
|
||
|
||
- [ ] **Step 2: Verify and commit**
|
||
|
||
```bash
|
||
test -s tasks/p1/p1-5-chunk.md && echo OK
|
||
git add tasks/p1/p1-5-chunk.md
|
||
git commit -m "tasks: add p1-5 chunk component spec"
|
||
```
|
||
|
||
### Task B6: `p1-6-store-sqlite.md`
|
||
|
||
**Files:**
|
||
- Create: `tasks/p1/p1-6-store-sqlite.md`
|
||
|
||
- [ ] **Step 1: Write the spec**
|
||
|
||
Write to `tasks/p1/p1-6-store-sqlite.md`:
|
||
|
||
````markdown
|
||
---
|
||
phase: P1
|
||
component: kebab-store-sqlite (P1 subset)
|
||
task_id: p1-6
|
||
title: "SQLite store: assets/documents/blocks/chunks + asset writer + migrations"
|
||
status: planned
|
||
depends_on: [p1-1, p1-4, p1-5]
|
||
unblocks: [p2-1, p3-3, p4-3]
|
||
contract_source: ../../docs/superpowers/specs/2026-04-27-kebab-final-form-design.md
|
||
contract_sections: [§5 DDL (5.1, 5.2, 5.3, 5.4, 5.5 chunks only — FTS handled in p2-1), §5.7 jobs/ingest_runs, §5.8 transactions, §6.3 data_dir layout]
|
||
---
|
||
|
||
# p1-6 — SQLite store (P1 subset)
|
||
|
||
## Goal
|
||
|
||
Persist `RawAsset`, `CanonicalDocument`, `Block`s, `Chunk`s into SQLite per design §5; copy raw asset bytes into `data_dir/assets/<aa>/<asset_id>` (or reference if larger than threshold); record an `ingest_runs` row.
|
||
|
||
## Why now / why this size
|
||
|
||
P1's terminal task. Closes the loop `walk → parse → chunk → store`. The FTS5 virtual table and triggers are intentionally deferred to p2-1 to keep this task focused on the relational schema and asset I/O.
|
||
|
||
## Allowed dependencies
|
||
|
||
- `kebab-core`
|
||
- `kebab-config`
|
||
- `rusqlite` (with `bundled-sqlcipher` disabled; use `bundled` feature)
|
||
- `refinery` for migrations
|
||
- `serde_json`
|
||
- `time`
|
||
- `blake3` (asset copy verification)
|
||
- `tracing`
|
||
- `thiserror`
|
||
|
||
## Forbidden dependencies
|
||
|
||
- `kebab-source-fs` (only types via `kebab-core`), `kebab-parse-md`, `kebab-normalize`, `kebab-chunk` (only types via `kebab-core`), `kebab-store-vector`, `kebab-embed*`, `kebab-search`, `kebab-llm*`, `kebab-rag`, `kebab-tui`, `kebab-desktop`
|
||
|
||
## Inputs
|
||
|
||
| input | type | source |
|
||
|-------|------|--------|
|
||
| migrations | `migrations/V001__init.sql` | repo |
|
||
| `RawAsset` + bytes | `(RawAsset, Vec<u8>)` | p1-1 + reader |
|
||
| `CanonicalDocument` | `kebab_core::CanonicalDocument` | p1-4 |
|
||
| `Vec<Chunk>` | `kebab_core::Chunk` | p1-5 |
|
||
| `IngestRun` aggregates | `(scope, counts, duration)` | `kebab-app` |
|
||
|
||
## Outputs
|
||
|
||
| output | type | downstream |
|
||
|--------|------|------------|
|
||
| `data_dir/kebab.sqlite` rows in `assets`, `documents`, `blocks`, `chunks`, `document_tags`, `ingest_runs`, `jobs`, `schema_meta`, `migrations` | – | every later phase |
|
||
| `data_dir/assets/<aa>/<asset_id>` bytes (when copied) | – | future re-extraction, integrity verification |
|
||
| `IngestReport` (wire schema v1) | `kebab_core::IngestReport` | `kebab-cli`, eval |
|
||
|
||
## Public surface (signatures only — no new types)
|
||
|
||
```rust
|
||
pub struct SqliteStore { /* internal */ }
|
||
|
||
impl SqliteStore {
|
||
pub fn open(config: &kebab_config::Config) -> anyhow::Result<Self>;
|
||
pub fn run_migrations(&self) -> anyhow::Result<()>;
|
||
|
||
pub fn put_asset_with_bytes(&self, asset: &kebab_core::RawAsset, bytes: &[u8]) -> anyhow::Result<()>;
|
||
}
|
||
|
||
impl kebab_core::DocumentStore for SqliteStore {
|
||
fn put_asset(&self, a: &kebab_core::RawAsset) -> anyhow::Result<()>;
|
||
fn put_document(&self, d: &kebab_core::CanonicalDocument) -> anyhow::Result<()>;
|
||
fn put_blocks(&self, doc: &kebab_core::DocumentId, blocks: &[kebab_core::Block]) -> anyhow::Result<()>;
|
||
fn put_chunks(&self, doc: &kebab_core::DocumentId, chunks: &[kebab_core::Chunk]) -> anyhow::Result<()>;
|
||
fn get_document(&self, id: &kebab_core::DocumentId) -> anyhow::Result<Option<kebab_core::CanonicalDocument>>;
|
||
fn get_chunk(&self, id: &kebab_core::ChunkId) -> anyhow::Result<Option<kebab_core::Chunk>>;
|
||
fn list_documents(&self, filter: &kebab_core::DocFilter) -> anyhow::Result<Vec<kebab_core::DocSummary>>;
|
||
}
|
||
|
||
impl kebab_core::JobRepo for SqliteStore { /* per design §7.2 signatures */ }
|
||
```
|
||
|
||
## Behavior contract
|
||
|
||
- DDL: `migrations/V001__init.sql` ships exactly the SQL in design §5.1, §5.2, §5.3, §5.4, §5.5 (chunks table only — FTS table & triggers come in p2-1 as `V002`), §5.7 jobs/ingest_runs/answers/eval_runs/eval_query_results, §5.6 embedding_records.
|
||
- Pragmas at open: `foreign_keys=ON`, `journal_mode=WAL`, `synchronous=NORMAL`, `temp_store=MEMORY`.
|
||
- One ingest of one document = one transaction (BEGIN..COMMIT). Partial failures roll back; warnings are not failures.
|
||
- Bulk ingest commits per-document.
|
||
- Asset writer:
|
||
- if `asset.byte_len <= storage.copy_threshold_mb * 1_048_576`: write bytes to `assets_dir/<asset_id[..2]>/<asset_id>` (mode 0o644), record `storage_kind='copied'`.
|
||
- else: do not copy; record `storage_kind='reference'` with `storage_path = asset.source_uri`'s file path.
|
||
- In either case, recompute `blake3` of the source bytes once on write/verify and store in `assets.checksum`. Mismatch → return `StoreError::Conflict`.
|
||
- Idempotency: re-ingesting the same `(workspace_path, asset_id, parser_version)` updates `documents.updated_at`, increments `doc_version`, replaces blocks/chunks. No row duplication.
|
||
- `document_tags`: re-derived from `Metadata.tags` on each put.
|
||
- `ingest_runs.items_json` is null when caller passes `summary_only=true`.
|
||
- All wire JSON returned (`IngestReport`) conforms to `docs/wire-schema/v1/ingest_report.schema.json`. Fail loudly if schema not present (caller must vendor it).
|
||
|
||
## Storage / wire effects
|
||
|
||
- Writes: `kebab.sqlite` (multiple tables), `data_dir/assets/<aa>/<asset_id>` (copied case).
|
||
- Reads on subsequent calls: same DB.
|
||
|
||
## Test plan
|
||
|
||
| kind | description | fixture / data |
|
||
|------|-------------|----------------|
|
||
| migration | fresh DB after `run_migrations` has all P1 tables and indexes | tmp dir |
|
||
| unit | put_asset_with_bytes copy mode writes file with correct mode and bytes | tmp dir |
|
||
| unit | put_asset_with_bytes reference mode does not write file but records path | tmp dir + large fake size |
|
||
| unit | checksum mismatch returns Conflict error | tmp dir + tampered bytes |
|
||
| unit | put_document idempotency: same input twice → 1 row, doc_version bumped | tmp dir |
|
||
| unit | put_blocks + put_chunks transactional rollback on simulated failure | tmp dir |
|
||
| contract | DocumentStore trait round-trip for fixture document | `fixtures/markdown/code-and-table.md` |
|
||
| snapshot | IngestReport JSON for fixture run | fixture |
|
||
|
||
All tests under `cargo test -p kebab-store-sqlite` with no network.
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] `cargo check -p kebab-store-sqlite` passes
|
||
- [ ] `cargo test -p kebab-store-sqlite` passes
|
||
- [ ] migration `V001__init.sql` matches design §5 verbatim (diff-checked in CI)
|
||
- [ ] Writes to `~/.local/share/kebab/` are gated by `kebab-config`'s `data_dir` and never escape it
|
||
- [ ] No imports outside Allowed dependencies
|
||
- [ ] PR links design §5
|
||
|
||
## Out of scope
|
||
|
||
- FTS5 virtual table and triggers (p2-1).
|
||
- Vector store (p3-3).
|
||
- Embedding records writer (p3-2).
|
||
- Search queries (p2-2).
|
||
|
||
## Risks / notes
|
||
|
||
- WAL mode requires careful test cleanup: tests must drop the connection before removing `kebab.sqlite-wal` / `-shm`.
|
||
- Asset directory shard prefix uses `asset_id[..2]`; using `asset_id[..1]` would create at most 16 dirs (insufficient).
|
||
````
|
||
|
||
- [ ] **Step 2: Verify and commit**
|
||
|
||
```bash
|
||
test -s tasks/p1/p1-6-store-sqlite.md && echo OK
|
||
git add tasks/p1/p1-6-store-sqlite.md
|
||
git commit -m "tasks: add p1-6 store-sqlite component spec"
|
||
```
|
||
|
||
### Task B7: Validate Phase B output
|
||
|
||
- [ ] **Step 1: Confirm 6 P1 specs exist and reference the design doc**
|
||
|
||
```bash
|
||
ls tasks/p1/ | sort
|
||
for f in tasks/p1/p1-*.md; do grep -q '2026-04-27-kebab-final-form-design.md' "$f" || echo "MISSING REF in $f"; done
|
||
echo done
|
||
```
|
||
|
||
Expected: lists `p1-1-source-fs.md` … `p1-6-store-sqlite.md`, no `MISSING REF` lines, ends with `done`.
|
||
|
||
- [ ] **Step 2: Confirm Allowed/Forbidden sections present in every spec**
|
||
|
||
```bash
|
||
for f in tasks/p1/p1-*.md; do
|
||
grep -q '^## Allowed dependencies' "$f" || echo "no Allowed in $f"
|
||
grep -q '^## Forbidden dependencies' "$f" || echo "no Forbidden in $f"
|
||
done
|
||
echo done
|
||
```
|
||
|
||
Expected: only `done` printed.
|
||
|
||
- [ ] **Step 3: Pause for user review**
|
||
|
||
Stop here. Wait for the user to skim `tasks/p1/*.md` and approve before Phase C kicks off. Phase C reuses Phase B's template shape, so a template-shape correction is cheaper now than after 24 more files.
|
||
|
||
---
|
||
|
||
## Phase C — Decompose remaining phases (P0, P2..P9)
|
||
|
||
For each component task below, the steps are: (1) write file, (2) verify, (3) commit. Body content follows the same template skeleton as Phase B (frontmatter + Goal + Why + Allowed/Forbidden + Inputs + Outputs + Public surface + Behavior contract + Storage/wire + Test plan + DoD + Out of scope + Risks). Each task **must** cite the listed `contract_sections` of the frozen design doc.
|
||
|
||
### Task C1: `tasks/p0/p0-1-skeleton.md`
|
||
|
||
**Files:** Create: `tasks/p0/p0-1-skeleton.md`. Also `mkdir -p tasks/p0`.
|
||
|
||
`contract_sections`: §3 (all subsections), §4, §5 (migrations meta only), §6, §7, §8, §10. `Allowed`: workspace + `kebab-core` + `kebab-config` + `kebab-app` + `kebab-cli` only.
|
||
|
||
Body covers: workspace `Cargo.toml` resolver=3, edition 2024, member list (`kebab-core`, `kebab-config`, `kebab-app`, `kebab-cli`), workspace dependencies, `kebab-core` types and traits per design §3 / §7, deterministic ID functions per §4 (with full unit tests), `kebab-config` loader (TOML + env + CLI override per §6.4), `kebab-app` facade signatures (`ingest`, `search`, `ask`, `inspect_doc`, `inspect_chunk`, `doctor`, `init`), `kebab-cli` skeleton with clap + `--help`. DoD: `cargo check --workspace`, `cargo test --workspace` (Newtype+ID+canonical-json tests only), `kebab --help` works, `docs/spec/*` stubs created (link to frozen design doc), `docs/wire-schema/v1/*.schema.json` stubs (one file per object in §2).
|
||
|
||
- [ ] **Step 1: Create directory and file**
|
||
|
||
```bash
|
||
mkdir -p tasks/p0
|
||
```
|
||
|
||
- [ ] **Step 2: Write the spec using the Phase B template, with the contents described above**
|
||
|
||
Use the Phase B Task B1 (`p1-1-source-fs.md`) file as the structural template. Replace fields: `phase: P0`, `task_id: p0-1`, `component: workspace`, `Allowed` and `Forbidden` per §8, `Inputs/Outputs` per §6/§7, `Public surface` listing every type and trait from §3 and §7, `Behavior contract` covering ID determinism + facade boundary, `Test plan` running ID determinism (1000 iters), Newtype Display/FromStr round-trip, canonical_json snapshot. `Out of scope`: anything that needs another crate added.
|
||
|
||
- [ ] **Step 3: Verify and commit**
|
||
|
||
```bash
|
||
test -s tasks/p0/p0-1-skeleton.md && echo OK
|
||
git add tasks/p0 tasks/INDEX.md
|
||
git -c user.name=kb -c user.email=kb@local commit -m "tasks: add p0-1 skeleton component spec"
|
||
```
|
||
|
||
### Task C2: `tasks/p2/` (P2 — 2 specs)
|
||
|
||
**Files:** Create: `tasks/p2/p2-1-fts-schema.md`, `tasks/p2/p2-2-lexical-retriever.md`. `mkdir -p tasks/p2`.
|
||
|
||
#### `p2-1-fts-schema.md`
|
||
|
||
`contract_sections`: §5.5 FTS5 + triggers, §9 versioning. `Allowed`: `kebab-core`, `kebab-config`, `kebab-store-sqlite` (extends migrations). `depends_on: [p1-6]`. Migration `V002__fts.sql` adds `chunks_fts` virtual table and three triggers verbatim from §5.5. Tests: backfill from existing chunks via `INSERT INTO chunks_fts SELECT ... FROM chunks`, then assert FTS row count == chunks row count; insert/update/delete in `chunks` reflects in `chunks_fts`.
|
||
|
||
#### `p2-2-lexical-retriever.md`
|
||
|
||
`contract_sections`: §3.7 SearchQuery/Hit, §0 Q3 citation (URI fragment), §1.5 search output (for snippet length defaults), §2.2 wire schema. `Allowed`: `kebab-core`, `kebab-config`, `kebab-store-sqlite`. `depends_on: [p2-1]`. Implements `Retriever` trait with `bm25(chunks_fts)` ranking, snippet via SQLite `snippet()` (≤ `snippet_chars` chars), citation built per §0 Q3 from `source_spans`. Tests: top-k correctness on fixture corpus, citation line range round-trip against original Markdown, deterministic across two runs.
|
||
|
||
- [ ] **Step 1: Create directory and both spec files** (template per Phase B; bodies as described above).
|
||
- [ ] **Step 2: Verify with `for f in tasks/p2/p2-*.md; do test -s "$f" || echo MISSING $f; done; echo done` (expect only `done`).**
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```bash
|
||
git add tasks/p2 && git -c user.name=kb -c user.email=kb@local commit -m "tasks: add P2 component specs (fts-schema, lexical-retriever)"
|
||
```
|
||
|
||
### Task C3: `tasks/p3/` (P3 — 4 specs)
|
||
|
||
**Files:** `mkdir -p tasks/p3` and create `p3-1-embedder-trait.md`, `p3-2-fastembed-adapter.md`, `p3-3-lancedb-store.md`, `p3-4-hybrid-fusion.md`.
|
||
|
||
- `p3-1-embedder-trait.md`: §3.7, §7.2 Embedder, §11. Allowed: `kebab-core`, `kebab-config`. No external embedding dep. Public surface: `Embedder` trait + `EmbeddingInput`/`EmbeddingKind` already in core (validate they exist; if not, this task is also a `kebab-core` patch). `depends_on: [p0-1]`. Tests: trait dyn dispatch, mock embedder.
|
||
- `p3-2-fastembed-adapter.md`: §11.3, §6.4 `[models.embedding]`. Allowed: `kebab-core`, `kebab-config`, `fastembed`, `tokenizers`, `ort`. `depends_on: [p3-1]`. Provides `FastembedEmbedder` implementing `Embedder` for `multilingual-e5-small` (default), with required Document/Query prefix per §11.3. Tests: dimension check, deterministic vector for fixed input (hash compare on first 8 floats with epsilon), batch size respected.
|
||
- `p3-3-lancedb-store.md`: §3.5, §5.6 embedding_records, §6.3 lancedb table naming. Allowed: `kebab-core`, `kebab-config`, `lancedb`, `arrow`, `kebab-store-sqlite` (write `embedding_records` row only — no other table). `depends_on: [p3-2, p1-6]`. Implements `VectorStore` trait. Table naming `chunk_embeddings_<model>_<dim>.lance`. `ensure_table` creates if missing. `upsert` inserts vectors and writes a matching `embedding_records` row in same logical operation (best-effort 2PC: lance commit, then SQLite insert; on SQLite failure, log warning + leave lance row — re-upsert is idempotent because of the `UNIQUE(chunk_id, model_id, model_version, dimensions)` constraint and lance upsert semantics). `search` filters via SearchFilters and returns top-k. Tests: smoke (insert+search), dimension mismatch error, model isolation (two models stay in two tables).
|
||
- `p3-4-hybrid-fusion.md`: §3.7 RetrievalDetail, §0 Q3, §1.6 search --explain, §6.4 `[search]` rrf settings. Allowed: `kebab-core`, `kebab-config`, `kebab-store-sqlite` (lexical Retriever from p2-2), `kebab-store-vector` (vector Retriever wrapper around `VectorStore::search`). `depends_on: [p2-2, p3-3]`. Implements `HybridRetriever` that dispatches by `SearchMode`, fuses with RRF (k from config, default 60), populates `lexical_score`, `vector_score`, `lexical_rank`, `vector_rank`, `fusion_score`. Tests: pure lexical mode == p2-2 output; pure vector mode == p3-3 output; hybrid produces strictly larger or equal coverage of expected hits than either single mode on a small fixture; deterministic.
|
||
|
||
- [ ] **Step 1: Create directory and 4 files** (template per Phase B).
|
||
- [ ] **Step 2: Verify**
|
||
|
||
```bash
|
||
for f in tasks/p3/p3-*.md; do test -s "$f" || echo MISSING $f; done; echo done
|
||
```
|
||
|
||
Expect only `done`.
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```bash
|
||
git add tasks/p3 && git -c user.name=kb -c user.email=kb@local commit -m "tasks: add P3 component specs (embedder, fastembed, lancedb, hybrid)"
|
||
```
|
||
|
||
### Task C4: `tasks/p4/` (P4 — 3 specs)
|
||
|
||
**Files:** `mkdir -p tasks/p4` and create `p4-1-llm-trait.md`, `p4-2-ollama-adapter.md`, `p4-3-rag-pipeline.md`.
|
||
|
||
- `p4-1-llm-trait.md`: §7.2 LanguageModel + TokenChunk, §0 Q5 streaming, §3.8 Answer types referenced. Allowed: `kebab-core`, `kebab-config`. `depends_on: [p0-1]`. Defines (or validates) `LanguageModel` trait, `GenerateRequest`, `TokenChunk`, `FinishReason`, `TokenUsage` per design. Tests: trait dyn dispatch, mock LM streams 3 tokens.
|
||
- `p4-2-ollama-adapter.md`: §11.2 Ollama, §6.4 `[models.llm]`, §0 Q5 streaming. Allowed: `kebab-core`, `kebab-config`, `reqwest` (blocking + json + stream feature) or `ureq` + manual SSE; `serde_json`, `tokio`/runtime if needed. `depends_on: [p4-1]`. Implements `OllamaLanguageModel` with streaming `/api/generate`. `temperature=0.0` default, `seed` honored for determinism. Reachability/missing-model errors map to `LlmError` per design §10. Tests: against a mock HTTP server (`wiremock` or hand-rolled `tiny_http`); deterministic stream collect equals buffered concatenation; missing model returns `LlmError::ModelNotPulled` with proper hint.
|
||
- `p4-3-rag-pipeline.md`: §0 Q4 refusal (two-layer), §0 Q7 footer, §1.1–1.4 ask scenes, §2.3 Answer wire, §3.8 internal Answer, §6.4 `[rag]`. Allowed: `kebab-core`, `kebab-config`, `kebab-search` (Retriever), `kebab-llm` (LanguageModel). `depends_on: [p3-4, p4-2]`. Pipeline: retrieve top-k → score gate (`refusal_reason: ScoreGate` if top1 < gate) → context packer (token budget + heading_path header `[#n doc=… heading=… span=…]`) → render `rag-v1` prompt → stream → collect → citation extraction (regex `\[(\d+)\]`) → citation validation (each `[n]` must map to a packed chunk; otherwise `grounded=false`, `refusal_reason: LlmSelfJudge`) → write `answers` row. Tests: happy path produces grounded Answer with citations; query with all chunks below gate produces ScoreGate refusal; query whose LLM emits a citation pointing to non-existent `[7]` becomes LlmSelfJudge refusal; identical query under temperature=0 produces byte-identical Answer (snapshot).
|
||
|
||
- [ ] **Step 1, 2, 3** as in C3.
|
||
|
||
```bash
|
||
mkdir -p tasks/p4
|
||
# write three files
|
||
for f in tasks/p4/p4-*.md; do test -s "$f" || echo MISSING $f; done; echo done
|
||
git add tasks/p4 && git -c user.name=kb -c user.email=kb@local commit -m "tasks: add P4 component specs (llm-trait, ollama, rag-pipeline)"
|
||
```
|
||
|
||
### Task C5: `tasks/p5/` (P5 — 2 specs)
|
||
|
||
**Files:** `mkdir -p tasks/p5` and create `p5-1-golden-fixture-runner.md`, `p5-2-metrics-compare.md`.
|
||
|
||
- `p5-1-golden-fixture-runner.md`: phase epic + §5.7 eval_runs/eval_query_results, §6.3 runs_dir. Allowed: `kebab-core`, `kebab-config`, `kebab-app` (calls facade for search/ask), `serde_yaml`. `depends_on: [p4-3]`. Loads `fixtures/golden_queries.yaml`, runs each query in selected mode (lexical/vector/hybrid/rag), captures per-query results to `eval_query_results` and to `runs_dir/<run_id>/per_query.jsonl`. Tests: fixture with 3 queries runs end-to-end on a tiny corpus, all rows recorded.
|
||
- `p5-2-metrics-compare.md`: phase epic, §0 Q6 wire schema. Allowed: `kebab-core`, `kebab-config`, `kebab-store-sqlite` (read eval rows). `depends_on: [p5-1]`. Computes hit@k, MRR, recall@k_doc, citation_coverage, groundedness (rule-based via `must_contain`), empty_result_rate, refusal_correctness. `kebab eval compare a b` produces wins/losses/draws + delta. Tests: fixed input rows produce expected metric values; compare produces stable sorted output.
|
||
|
||
- [ ] **Step 1, 2, 3** as in C3.
|
||
|
||
```bash
|
||
mkdir -p tasks/p5
|
||
# write two files
|
||
for f in tasks/p5/p5-*.md; do test -s "$f" || echo MISSING $f; done; echo done
|
||
git add tasks/p5 && git -c user.name=kb -c user.email=kb@local commit -m "tasks: add P5 component specs (runner, metrics)"
|
||
```
|
||
|
||
### Task C6: `tasks/p6/` (P6 — 3 specs)
|
||
|
||
`mkdir -p tasks/p6` and create:
|
||
|
||
- `p6-1-image-extractor-exif.md`: phase epic §9.1, §3.4 ImageRefBlock, §3.7a ImageType. Allowed: `kebab-core`, `kebab-config`, `image`, `kamadak-exif`. Implements `Extractor` for `MediaType::Image(_)` producing a `CanonicalDocument` whose body is exactly one `ImageRefBlock`. EXIF goes to `metadata.user`. `depends_on: [p0-1, p1-6]`. Tests: PNG/JPEG decode metadata; EXIF extraction; deterministic doc_id.
|
||
- `p6-2-ocr-adapter.md`: phase epic §9.1. Allowed: `kebab-core`, `kebab-config`, `image`, OS-specific OCR (feature `apple-vision` for macOS via sidecar binary; feature `tesseract` for cross-platform; default tesseract). Defines `OcrEngine` trait + adapter. Populates `ImageRefBlock.ocr` `OcrText` (`joined`, regions, engine, engine_version). `depends_on: [p6-1]`. Tests: deterministic text on a fixed fixture image with high-confidence text.
|
||
- `p6-3-caption-adapter.md`: phase epic §9.1 caption section, §3.7a ModelCaption. Allowed: `kebab-core`, `kebab-config`, `kebab-llm` (reuse LanguageModel for VLM). Optional/feature-gated. `depends_on: [p6-1, p4-2]`. Populates `ImageRefBlock.caption`. Tests: with mock LM, caption recorded with model id; absence of feature flag leaves caption=None.
|
||
|
||
- [ ] **Step 1, 2, 3** as in C3.
|
||
|
||
```bash
|
||
mkdir -p tasks/p6
|
||
# write three files
|
||
for f in tasks/p6/p6-*.md; do test -s "$f" || echo MISSING $f; done; echo done
|
||
git add tasks/p6 && git -c user.name=kb -c user.email=kb@local commit -m "tasks: add P6 component specs (image-exif, ocr, caption)"
|
||
```
|
||
|
||
### Task C7: `tasks/p7/` (P7 — 2 specs)
|
||
|
||
`mkdir -p tasks/p7` and create:
|
||
|
||
- `p7-1-pdf-text-extractor.md`: phase epic §9.2, §3.4 SourceSpan::Page. Allowed: `kebab-core`, `kebab-config`, `pdf-extract`, `lopdf` (page metadata). `depends_on: [p0-1, p1-6]`. Extractor for `MediaType::Pdf` produces a `CanonicalDocument` with one `Paragraph` per page, `SourceSpan::Page`. Failed-text pages are emitted as paragraphs with empty text and a `Provenance` warning marking them as scanned candidates. Tests: page count, span correctness, failure handling.
|
||
- `p7-2-pdf-page-chunker.md`: phase epic §9.2, §3.5, §0 Q3 citation. Allowed: `kebab-core`, `kebab-config`. New chunker version `pdf-page-v1` that respects page boundaries. `depends_on: [p7-1]`. Tests: chunk does not cross page boundary; very long page subdivides per `target_tokens`.
|
||
|
||
- [ ] **Step 1, 2, 3** as in C3.
|
||
|
||
```bash
|
||
mkdir -p tasks/p7
|
||
# write two files
|
||
for f in tasks/p7/p7-*.md; do test -s "$f" || echo MISSING $f; done; echo done
|
||
git add tasks/p7 && git -c user.name=kb -c user.email=kb@local commit -m "tasks: add P7 component specs (pdf-extractor, pdf-chunker)"
|
||
```
|
||
|
||
### Task C8: `tasks/p8/` (P8 — 2 specs)
|
||
|
||
`mkdir -p tasks/p8` and create:
|
||
|
||
- `p8-1-whisper-adapter.md`: phase epic §9.3, §3.4 AudioRefBlock + `Transcript`. Allowed: `kebab-core`, `kebab-config`, whisper.cpp Rust binding (`whisper-rs`) or sidecar binary. `depends_on: [p0-1, p1-6]`. Implements `Transcriber` trait. Default model `large-v3` via config; tests use a tiny model (e.g., `base.en`) for speed. Tests: monotone segment timestamps, language detection populated, deterministic transcript on fixed audio.
|
||
- `p8-2-segment-chunker.md`: phase epic §9.3, §3.5. New `audio-segment-v1` chunker that groups segments up to `target_tokens` with priority on speaker turn boundaries (when present). `depends_on: [p8-1]`. Tests: chunk timestamp == first/last segment timestamp; speaker change forces split.
|
||
|
||
- [ ] **Step 1, 2, 3** as in C3.
|
||
|
||
```bash
|
||
mkdir -p tasks/p8
|
||
# write two files
|
||
for f in tasks/p8/p8-*.md; do test -s "$f" || echo MISSING $f; done; echo done
|
||
git add tasks/p8 && git -c user.name=kb -c user.email=kb@local commit -m "tasks: add P8 component specs (whisper, audio-chunker)"
|
||
```
|
||
|
||
### Task C9: `tasks/p9/` (P9 — 5 specs)
|
||
|
||
`mkdir -p tasks/p9` and create:
|
||
|
||
- `p9-1-tui-library.md`: phase epic §16.2, §3.7. Allowed: `kebab-core`, `kebab-app` only (UI law). `ratatui`, `crossterm`. `depends_on: [p1-6]`. Library list view + tag filter. Tests: snapshot of rendered frame against fixture corpus list.
|
||
- `p9-2-tui-search.md`: phase epic §16.2, §1.5. Allowed: same as p9-1. `depends_on: [p2-2, p3-4]`. Search input + result list + preview pane; `Enter` triggers external editor jump (`$EDITOR +<line> <path>`). Tests: search results render; `g` keybinding constructs the correct editor command.
|
||
- `p9-3-tui-ask.md`: phase epic §16.2, §1.1, §1.2. Allowed: same. `depends_on: [p4-3]`. Ask pane shows streaming tokens; `--explain` toggle. Tests: streaming render, refusal render.
|
||
- `p9-4-tui-inspect.md`: §1.6 inspect, §3.5. Allowed: same. `depends_on: [p1-6, p3-3]`. Renders Document and Chunk inspection per wire schemas 2.5/2.6.
|
||
- `p9-5-desktop-tauri.md`: phase epic §16.3, §1 all scenes. Allowed: `kebab-core`, `kebab-app`, Tauri backend; frontend stack TBD by user (vanilla TS by default). `depends_on: [p9-1, p9-2, p9-3, p9-4]`. Backend exposes Tauri commands that wrap `kebab-app` 1:1. Source viewer per medium (Markdown render, PDF page, image with region overlay, audio with seek). Tests: backend command unit tests (no frontend e2e in this task).
|
||
|
||
- [ ] **Step 1, 2, 3** as in C3.
|
||
|
||
```bash
|
||
mkdir -p tasks/p9
|
||
# write five files
|
||
for f in tasks/p9/p9-*.md; do test -s "$f" || echo MISSING $f; done; echo done
|
||
git add tasks/p9 && git -c user.name=kb -c user.email=kb@local commit -m "tasks: add P9 component specs (tui x4, desktop)"
|
||
```
|
||
|
||
### Task C10: Final INDEX update
|
||
|
||
**Files:** Modify: `tasks/INDEX.md` — extend the "Component task decomposition" subsection added in B0 to list every phase.
|
||
|
||
- [ ] **Step 1: Replace the subsection added in B0 with the full list**
|
||
|
||
The new subsection reads:
|
||
|
||
```markdown
|
||
## Component task decomposition (per phase)
|
||
|
||
각 phase 의 component-level 분해. AI sub-agent 1세션 = 1 task 가 sweet spot.
|
||
|
||
- P0 — [p0/](p0/) — 1 component
|
||
- [p0-1 skeleton](p0/p0-1-skeleton.md)
|
||
- P1 — [p1/](p1/) — 6 components
|
||
- [p1-1 source-fs](p1/p1-1-source-fs.md)
|
||
- [p1-2 parse-md frontmatter](p1/p1-2-parse-md-frontmatter.md)
|
||
- [p1-3 parse-md blocks](p1/p1-3-parse-md-blocks.md)
|
||
- [p1-4 normalize](p1/p1-4-normalize.md)
|
||
- [p1-5 chunk](p1/p1-5-chunk.md)
|
||
- [p1-6 store-sqlite](p1/p1-6-store-sqlite.md)
|
||
- P2 — [p2/](p2/) — 2 components
|
||
- [p2-1 fts-schema](p2/p2-1-fts-schema.md)
|
||
- [p2-2 lexical-retriever](p2/p2-2-lexical-retriever.md)
|
||
- P3 — [p3/](p3/) — 4 components
|
||
- [p3-1 embedder-trait](p3/p3-1-embedder-trait.md)
|
||
- [p3-2 fastembed-adapter](p3/p3-2-fastembed-adapter.md)
|
||
- [p3-3 lancedb-store](p3/p3-3-lancedb-store.md)
|
||
- [p3-4 hybrid-fusion](p3/p3-4-hybrid-fusion.md)
|
||
- P4 — [p4/](p4/) — 3 components
|
||
- [p4-1 llm-trait](p4/p4-1-llm-trait.md)
|
||
- [p4-2 ollama-adapter](p4/p4-2-ollama-adapter.md)
|
||
- [p4-3 rag-pipeline](p4/p4-3-rag-pipeline.md)
|
||
- P5 — [p5/](p5/) — 2 components
|
||
- [p5-1 golden-fixture-runner](p5/p5-1-golden-fixture-runner.md)
|
||
- [p5-2 metrics-compare](p5/p5-2-metrics-compare.md)
|
||
- P6 — [p6/](p6/) — 3 components
|
||
- [p6-1 image-extractor-exif](p6/p6-1-image-extractor-exif.md)
|
||
- [p6-2 ocr-adapter](p6/p6-2-ocr-adapter.md)
|
||
- [p6-3 caption-adapter](p6/p6-3-caption-adapter.md)
|
||
- P7 — [p7/](p7/) — 2 components
|
||
- [p7-1 pdf-text-extractor](p7/p7-1-pdf-text-extractor.md)
|
||
- [p7-2 pdf-page-chunker](p7/p7-2-pdf-page-chunker.md)
|
||
- P8 — [p8/](p8/) — 2 components
|
||
- [p8-1 whisper-adapter](p8/p8-1-whisper-adapter.md)
|
||
- [p8-2 segment-chunker](p8/p8-2-segment-chunker.md)
|
||
- P9 — [p9/](p9/) — 5 components
|
||
- [p9-1 tui-library](p9/p9-1-tui-library.md)
|
||
- [p9-2 tui-search](p9/p9-2-tui-search.md)
|
||
- [p9-3 tui-ask](p9/p9-3-tui-ask.md)
|
||
- [p9-4 tui-inspect](p9/p9-4-tui-inspect.md)
|
||
- [p9-5 desktop-tauri](p9/p9-5-desktop-tauri.md)
|
||
```
|
||
|
||
- [ ] **Step 2: Verify total count**
|
||
|
||
```bash
|
||
ls tasks/p?/p*-*.md | wc -l
|
||
```
|
||
|
||
Expected: `30` (= 1 + 6 + 2 + 4 + 3 + 2 + 3 + 2 + 2 + 5).
|
||
|
||
- [ ] **Step 3: Commit**
|
||
|
||
```bash
|
||
git add tasks/INDEX.md
|
||
git -c user.name=kb -c user.email=kb@local commit -m "tasks: update INDEX with full component-task tree (30 specs)"
|
||
```
|
||
|
||
---
|
||
|
||
## Final acceptance
|
||
|
||
- [ ] `tasks/_template.md` exists.
|
||
- [ ] `tasks/p0/`, `tasks/p1/` … `tasks/p9/` exist with the component spec files listed above.
|
||
- [ ] Every component spec contains: frontmatter (with `contract_sections`), Allowed/Forbidden, Inputs, Outputs, Public surface, Behavior contract, Storage/wire effects, Test plan, Definition of Done, Out of scope, Risks/notes.
|
||
- [ ] `tasks/INDEX.md` lists every component task.
|
||
- [ ] No new domain types introduced inside any component spec — every type referenced is defined in [docs/superpowers/specs/2026-04-27-kebab-final-form-design.md](../specs/2026-04-27-kebab-final-form-design.md).
|
||
- [ ] All commits authored sequentially per task; rollback is per-task.
|