Files
delta-backup/PLAN.md
mikael-lovqvists-claude-agent 8d1d1241b6 Add file attribute tracking TODO to PLAN.md
Document planned approach for capturing and restoring file metadata
(permissions, mtime, uid/gid) alongside content deltas. Notes the need
for a pluggable attribute backend due to filesystem differences, and the
requirement to fail loudly on incompatible attributes during restore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 02:07:09 +00:00

8.7 KiB

delta-backup — Planning Document

Concept

A CLI tool for space-efficient directory backups using binary deltas. Instead of storing full snapshots each run, it stores the difference between the previous and current state, making backup storage grow proportionally to what actually changed.

Directory Roles

Name Purpose
SOURCE Live data, possibly remote (e.g. rsync-accessible path)
PREV Last known good state — the base for delta generation
PEND Working area — assembled current state before diffing
DELTAS Stored deltas + manifests + state tracking

Full Run Sequence

  1. Clear PEND — remove all contents
  2. rsync PREV → PEND — seed locally (fast)
  3. rsync SOURCE → PEND — apply remote changes (only diffs travel over the wire)
  4. Generate delta — parse rsync itemize output to get change list, produce per-file deltas + manifest
  5. Commit delta — write to DELTAS atomically
  6. Promote PEND → PREV — swap working area to become new base

Safety / State Machine

Sequence numbers (not timestamps) identify each delta. A state.json in DELTAS tracks progress:

{ "next_seq": 5, "last_complete": 4 }

Phase transitions are written to state.json so an aborted run can be detected and recovered.

Atomic commit strategy:

  1. Write delta files to DELTAS/tmp/N/
  2. Rename DELTAS/tmp/N/DELTAS/N/ (atomic on same filesystem)
  3. Promote PEND → PREV
  4. Update state.json

The presence of a fully-renamed DELTAS/N/ directory is the canonical "delta committed" marker. State.json is a recoverable cache — can be reconstructed by scanning DELTAS.

Recovery rules:

  • DELTAS/N/ exists but last_complete is N-1 → finish promotion, update state
  • state.json missing → reconstruct from directory scan

Change Detection

No directory walk needed. rsync SOURCE→PEND is run with --itemize-changes, producing a machine-readable list of exactly what changed. Output is captured (not streamed) and parsed:

rsync prefix Meaning
>f+++++++++ New file
>f.st...... Modified file (any combination of change flags)
*deleting Deleted file
cd+++++++++ New directory (ignored for delta purposes)

Lines starting with >f or *deleting are extracted. The path is the remainder after the 11-character itemize code + space. This becomes the change list fed directly into delta generation — no separate directory walk required.

Delta Format

Pluggable backend interface with two operations:

backend.createDelta(prevFile, newFile, outFile)  // spawn process, no shell strings
backend.applyDelta(prevFile, deltaFile, outFile) // spawn process, no shell strings

Default backend: zstd

  • Modified files: zstd --patch-from=prev new -o out.zst
  • New files: zstd new -o out.zst (no base)
  • Deleted files: manifest entry only, no delta file

Planned backends: xdelta3, bsdiff

Manifest Format

Each delta DELTAS/N/ contains:

  • manifest.json — lists all changed files with their status (added/modified/deleted) and metadata
  • files/ — per-file delta or compressed blobs
{
  "seq": 5,
  "timestamp": "2026-03-07T12:00:00Z",
  "prev_seq": 4,
  "backend": "zstd",
  "changes": [
    { "path": "src/main.js", "status": "modified", "delta": "files/src__main.js.zst" },
    { "path": "assets/logo.png", "status": "added",    "delta": "files/assets__logo.png.zst" },
    { "path": "old/thing.txt",   "status": "deleted" }
  ]
}

CLI Interface

delta-backup [options] <command>

Commands:
  run       Full backup run
  status    Show current state (sequences, last run, pending recovery)
  restore   Apply deltas to reconstruct a point in time (future)

Options:
  --source <path>      SOURCE directory (required)
  --prev <path>        PREV directory (required)
  --pend <path>        PEND directory (required)
  --deltas <path>      DELTAS directory (required)
  --backend <name>     Delta backend: zstd (default), xdelta3
  --dry-run            Print what would happen, execute nothing
  --config <file>      Load options from JSON config file (flags override)

Guards: refuse to run if any required path is missing from args AND config. Never fall back to CWD or implicit defaults for directories — explicit is safer.

Process Spawning

All external tools (rsync, zstd, xdelta3) are spawned with explicit argument arrays. No shell string interpolation ever. Use Node's child_process.spawn or similar.

Planned: Operation Abstractions

Currently dry-run logic is scattered inline throughout the run command. The intent is to refactor toward self-describing operation objects — each operation knows both how to describe itself (for dry-run) and how to execute itself. This makes the run command a clean sequence of operations, makes per-tool behavior easy to adjust (e.g. rsync exit code handling), and makes dry-run output a natural consequence of the abstraction rather than duplicated conditional logic.

Sketch:

// Each tool gets its own operation type
const op = rsyncOp({ args: [...], allowedExitCodes: [0, 24] });
op.describe(); // prints what it would do
await op.run(); // executes

// Run command becomes:
const ops = buildOps(config);
if (dry) ops.forEach(op => op.describe());
else     for (const op of ops) await op.run();

Per-tool exit code handling (e.g. rsync's partial transfer codes) lives inside the operation, not scattered across callers.

Current: rsync Exit Code Handling

rsync meaningful exit codes:

  • 0 — success
  • 23 — partial transfer due to error (fatal)
  • 24 — partial transfer due to vanished source files (acceptable in some cases)

Currently basic: any non-zero exit code throws. Finer-grained handling planned as part of the operation abstraction refactor.

Known Limitations

Delta file naming

Delta files are named by numeric index (e.g. 0.zst, 1.zst) rather than by path. The manifest maps each index to its source path. Path-based naming was considered but rejected because:

  • Deep directory trees can exceed filesystem filename length limits
  • Path separator substitution (e.g. /__) is ambiguous for filenames containing that sequence

Cross-file deduplication

Per-file deltas cannot exploit similarity between different files — each file is compressed/diffed in isolation. Identical or near-identical files in different locations get no benefit from each other. Approaches that could address this:

  • zstd --train to build a shared dictionary from the corpus, then compress all deltas against it
  • Content-addressed storage (deduplicate at the block or file level before delta generation)
  • Tar the entire PEND tree and delta against the previous tar (single-stream, cross-file repetition is visible to the compressor — but random access for restore becomes harder)

These are significant complexity increases and out of scope for now.

File attribute tracking (TODO)

Currently the manifest records only file content changes. File metadata (permissions, mtime, ownership, xattrs) is not tracked, meaning restore cannot faithfully reconstruct the original state.

Planned approach:

  • On each run, compare attributes between PREV and PEND for every file in the change list
  • Encode attribute changes explicitly in the manifest alongside content changes
  • Restore walks the delta chain applying both content deltas and attribute deltas in sequence

Design considerations:

  • fs.stat() gives mode, mtime, uid, gid — but not xattrs, ACLs, or fs-specific attributes
  • Attribute richness is highly filesystem-dependent (ext4, btrfs, APFS, NTFS all differ)
  • Need a pluggable attribute backend, similar to the delta backend, so the attribute set captured and restored can be tuned per deployment without changing core logic
  • Restore must handle the case where an attribute from an older delta is no longer representable on the target filesystem (e.g. restoring to a different fs type) — fail loudly rather than silently skip
  • rsync -a already preserves attributes into PEND, so PEND is always the authoritative source of truth for what attributes should be at that point in time

Occasional Snapshots

Delta chains are efficient but fragile over long chains. Periodic full snapshots (every N deltas, or on demand) bound the reconstruction blast radius. Snapshot support is planned but not in scope for initial implementation.

Implementation Phases

  1. Phase 1 (now): Arg parsing, config, dry-run, guards, rsync steps
  2. Phase 2: Delta generation with zstd backend, manifest writing, atomic commit
  3. Phase 3: PREV promotion, state.json management, recovery logic
  4. Phase 4: status and restore commands
  5. Future: Additional backends, snapshot support, scheduling