# delta-backup — Planning Document ## Concept A CLI tool for space-efficient directory backups using binary deltas. Instead of storing full snapshots each run, it stores the *difference* between the previous and current state, making backup storage grow proportionally to what actually changed. ## Directory Roles | Name | Purpose | |--------|---------| | SOURCE | Live data, possibly remote (e.g. rsync-accessible path) | | PREV | Last known good state — the base for delta generation | | PEND | Working area — assembled current state before diffing | | DELTAS | Stored deltas + manifests + state tracking | ## Full Run Sequence 1. **Clear PEND** — remove all contents 2. **rsync PREV → PEND** — seed locally (fast) 3. **rsync SOURCE → PEND** — apply remote changes (only diffs travel over the wire) 4. **Generate delta** — parse rsync itemize output to get change list, produce per-file deltas + manifest 5. **Commit delta** — write to DELTAS atomically 6. **Promote PEND → PREV** — swap working area to become new base ## Safety / State Machine Sequence numbers (not timestamps) identify each delta. A `state.json` in DELTAS tracks progress: ```json { "next_seq": 5, "last_complete": 4 } ``` Phase transitions are written to state.json so an aborted run can be detected and recovered. **Atomic commit strategy:** 1. Write delta files to `DELTAS/tmp/N/` 2. Rename `DELTAS/tmp/N/` → `DELTAS/N/` (atomic on same filesystem) 3. Promote PEND → PREV 4. Update state.json The presence of a fully-renamed `DELTAS/N/` directory is the canonical "delta committed" marker. State.json is a recoverable cache — can be reconstructed by scanning DELTAS. **Recovery rules:** - `DELTAS/N/` exists but `last_complete` is N-1 → finish promotion, update state - state.json missing → reconstruct from directory scan ## Change Detection No directory walk needed. rsync SOURCE→PEND is run with `--itemize-changes`, producing a machine-readable list of exactly what changed. Output is captured (not streamed) and parsed: | rsync prefix | Meaning | |-------------|----------| | `>f+++++++++` | New file | | `>f.st......` | Modified file (any combination of change flags) | | `*deleting` | Deleted file | | `cd+++++++++` | New directory (ignored for delta purposes) | Lines starting with `>f` or `*deleting` are extracted. The path is the remainder after the 11-character itemize code + space. This becomes the change list fed directly into delta generation — no separate directory walk required. ## Delta Format Pluggable backend interface with two operations: ```js backend.createDelta(prevFile, newFile, outFile) // spawn process, no shell strings backend.applyDelta(prevFile, deltaFile, outFile) // spawn process, no shell strings ``` **Default backend: zstd** - Modified files: `zstd --patch-from=prev new -o out.zst` - New files: `zstd new -o out.zst` (no base) - Deleted files: manifest entry only, no delta file **Planned backends:** xdelta3, bsdiff ## Manifest Format Each delta `DELTAS/N/` contains: - `manifest.json` — lists all changed files with their status (added/modified/deleted) and metadata - `files/` — per-file delta or compressed blobs ```json { "seq": 5, "timestamp": "2026-03-07T12:00:00Z", "prev_seq": 4, "backend": "zstd", "changes": [ { "path": "src/main.js", "status": "modified", "delta": "files/src__main.js.zst" }, { "path": "assets/logo.png", "status": "added", "delta": "files/assets__logo.png.zst" }, { "path": "old/thing.txt", "status": "deleted" } ] } ``` ## CLI Interface ``` delta-backup [options] Commands: run Full backup run status Show current state (sequences, last run, pending recovery) restore Apply deltas to reconstruct a point in time (future) Options: --source SOURCE directory (required) --prev PREV directory (required) --pend PEND directory (required) --deltas DELTAS directory (required) --backend Delta backend: zstd (default), xdelta3 --dry-run Print what would happen, execute nothing --config Load options from JSON config file (flags override) ``` Guards: refuse to run if any required path is missing from args AND config. Never fall back to CWD or implicit defaults for directories — explicit is safer. ## Process Spawning All external tools (rsync, zstd, xdelta3) are spawned with explicit argument arrays. No shell string interpolation ever. Use Node's `child_process.spawn` or similar. ### Planned: Operation Abstractions Currently dry-run logic is scattered inline throughout the run command. The intent is to refactor toward self-describing operation objects — each operation knows both how to describe itself (for dry-run) and how to execute itself. This makes the run command a clean sequence of operations, makes per-tool behavior easy to adjust (e.g. rsync exit code handling), and makes dry-run output a natural consequence of the abstraction rather than duplicated conditional logic. Sketch: ```js // Each tool gets its own operation type const op = rsyncOp({ args: [...], allowedExitCodes: [0, 24] }); op.describe(); // prints what it would do await op.run(); // executes // Run command becomes: const ops = buildOps(config); if (dry) ops.forEach(op => op.describe()); else for (const op of ops) await op.run(); ``` Per-tool exit code handling (e.g. rsync's partial transfer codes) lives inside the operation, not scattered across callers. ### Current: rsync Exit Code Handling rsync meaningful exit codes: - `0` — success - `23` — partial transfer due to error (fatal) - `24` — partial transfer due to vanished source files (acceptable in some cases) Currently basic: any non-zero exit code throws. Finer-grained handling planned as part of the operation abstraction refactor. ## Known Limitations ### Delta file naming Delta files are named by numeric index (e.g. `0.zst`, `1.zst`) rather than by path. The manifest maps each index to its source path. Path-based naming was considered but rejected because: - Deep directory trees can exceed filesystem filename length limits - Path separator substitution (e.g. `/` → `__`) is ambiguous for filenames containing that sequence ### Cross-file deduplication Per-file deltas cannot exploit similarity between different files — each file is compressed/diffed in isolation. Identical or near-identical files in different locations get no benefit from each other. Approaches that could address this: - `zstd --train` to build a shared dictionary from the corpus, then compress all deltas against it - Content-addressed storage (deduplicate at the block or file level before delta generation) - Tar the entire PEND tree and delta against the previous tar (single-stream, cross-file repetition is visible to the compressor — but random access for restore becomes harder) These are significant complexity increases and out of scope for now. ### File attribute tracking (TODO) Currently the manifest records only file content changes. File metadata (permissions, mtime, ownership, xattrs) is not tracked, meaning restore cannot faithfully reconstruct the original state. **Planned approach:** - On each run, compare attributes between PREV and PEND for every file in the change list - Encode attribute changes explicitly in the manifest alongside content changes - Restore walks the delta chain applying both content deltas and attribute deltas in sequence **Design considerations:** - `fs.stat()` gives mode, mtime, uid, gid — but not xattrs, ACLs, or fs-specific attributes - Attribute richness is highly filesystem-dependent (ext4, btrfs, APFS, NTFS all differ) - Need a pluggable attribute backend, similar to the delta backend, so the attribute set captured and restored can be tuned per deployment without changing core logic - Restore must handle the case where an attribute from an older delta is no longer representable on the target filesystem (e.g. restoring to a different fs type) — fail loudly rather than silently skip - rsync `-a` already preserves attributes into PEND, so PEND is always the authoritative source of truth for what attributes should be at that point in time ## Occasional Snapshots Delta chains are efficient but fragile over long chains. Periodic full snapshots (every N deltas, or on demand) bound the reconstruction blast radius. Snapshot support is planned but not in scope for initial implementation. ## Implementation Phases 1. **Phase 1 (now):** Arg parsing, config, dry-run, guards, rsync steps 2. **Phase 2:** Delta generation with zstd backend, manifest writing, atomic commit 3. **Phase 3:** PREV promotion, state.json management, recovery logic 4. **Phase 4:** `status` and `restore` commands 5. **Future:** Additional backends, snapshot support, scheduling