Document planned approach for capturing and restoring file metadata (permissions, mtime, uid/gid) alongside content deltas. Notes the need for a pluggable attribute backend due to filesystem differences, and the requirement to fail loudly on incompatible attributes during restore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.7 KiB
delta-backup — Planning Document
Concept
A CLI tool for space-efficient directory backups using binary deltas. Instead of storing full snapshots each run, it stores the difference between the previous and current state, making backup storage grow proportionally to what actually changed.
Directory Roles
| Name | Purpose |
|---|---|
| SOURCE | Live data, possibly remote (e.g. rsync-accessible path) |
| PREV | Last known good state — the base for delta generation |
| PEND | Working area — assembled current state before diffing |
| DELTAS | Stored deltas + manifests + state tracking |
Full Run Sequence
- Clear PEND — remove all contents
- rsync PREV → PEND — seed locally (fast)
- rsync SOURCE → PEND — apply remote changes (only diffs travel over the wire)
- Generate delta — parse rsync itemize output to get change list, produce per-file deltas + manifest
- Commit delta — write to DELTAS atomically
- Promote PEND → PREV — swap working area to become new base
Safety / State Machine
Sequence numbers (not timestamps) identify each delta. A state.json in DELTAS tracks progress:
{ "next_seq": 5, "last_complete": 4 }
Phase transitions are written to state.json so an aborted run can be detected and recovered.
Atomic commit strategy:
- Write delta files to
DELTAS/tmp/N/ - Rename
DELTAS/tmp/N/→DELTAS/N/(atomic on same filesystem) - Promote PEND → PREV
- Update state.json
The presence of a fully-renamed DELTAS/N/ directory is the canonical "delta committed" marker.
State.json is a recoverable cache — can be reconstructed by scanning DELTAS.
Recovery rules:
DELTAS/N/exists butlast_completeis N-1 → finish promotion, update state- state.json missing → reconstruct from directory scan
Change Detection
No directory walk needed. rsync SOURCE→PEND is run with --itemize-changes, producing a
machine-readable list of exactly what changed. Output is captured (not streamed) and parsed:
| rsync prefix | Meaning |
|---|---|
>f+++++++++ |
New file |
>f.st...... |
Modified file (any combination of change flags) |
*deleting |
Deleted file |
cd+++++++++ |
New directory (ignored for delta purposes) |
Lines starting with >f or *deleting are extracted. The path is the remainder after the
11-character itemize code + space. This becomes the change list fed directly into delta generation
— no separate directory walk required.
Delta Format
Pluggable backend interface with two operations:
backend.createDelta(prevFile, newFile, outFile) // spawn process, no shell strings
backend.applyDelta(prevFile, deltaFile, outFile) // spawn process, no shell strings
Default backend: zstd
- Modified files:
zstd --patch-from=prev new -o out.zst - New files:
zstd new -o out.zst(no base) - Deleted files: manifest entry only, no delta file
Planned backends: xdelta3, bsdiff
Manifest Format
Each delta DELTAS/N/ contains:
manifest.json— lists all changed files with their status (added/modified/deleted) and metadatafiles/— per-file delta or compressed blobs
{
"seq": 5,
"timestamp": "2026-03-07T12:00:00Z",
"prev_seq": 4,
"backend": "zstd",
"changes": [
{ "path": "src/main.js", "status": "modified", "delta": "files/src__main.js.zst" },
{ "path": "assets/logo.png", "status": "added", "delta": "files/assets__logo.png.zst" },
{ "path": "old/thing.txt", "status": "deleted" }
]
}
CLI Interface
delta-backup [options] <command>
Commands:
run Full backup run
status Show current state (sequences, last run, pending recovery)
restore Apply deltas to reconstruct a point in time (future)
Options:
--source <path> SOURCE directory (required)
--prev <path> PREV directory (required)
--pend <path> PEND directory (required)
--deltas <path> DELTAS directory (required)
--backend <name> Delta backend: zstd (default), xdelta3
--dry-run Print what would happen, execute nothing
--config <file> Load options from JSON config file (flags override)
Guards: refuse to run if any required path is missing from args AND config. Never fall back to CWD or implicit defaults for directories — explicit is safer.
Process Spawning
All external tools (rsync, zstd, xdelta3) are spawned with explicit argument arrays.
No shell string interpolation ever. Use Node's child_process.spawn or similar.
Planned: Operation Abstractions
Currently dry-run logic is scattered inline throughout the run command. The intent is to refactor toward self-describing operation objects — each operation knows both how to describe itself (for dry-run) and how to execute itself. This makes the run command a clean sequence of operations, makes per-tool behavior easy to adjust (e.g. rsync exit code handling), and makes dry-run output a natural consequence of the abstraction rather than duplicated conditional logic.
Sketch:
// Each tool gets its own operation type
const op = rsyncOp({ args: [...], allowedExitCodes: [0, 24] });
op.describe(); // prints what it would do
await op.run(); // executes
// Run command becomes:
const ops = buildOps(config);
if (dry) ops.forEach(op => op.describe());
else for (const op of ops) await op.run();
Per-tool exit code handling (e.g. rsync's partial transfer codes) lives inside the operation, not scattered across callers.
Current: rsync Exit Code Handling
rsync meaningful exit codes:
0— success23— partial transfer due to error (fatal)24— partial transfer due to vanished source files (acceptable in some cases)
Currently basic: any non-zero exit code throws. Finer-grained handling planned as part of the operation abstraction refactor.
Known Limitations
Delta file naming
Delta files are named by numeric index (e.g. 0.zst, 1.zst) rather than by path. The manifest
maps each index to its source path. Path-based naming was considered but rejected because:
- Deep directory trees can exceed filesystem filename length limits
- Path separator substitution (e.g.
/→__) is ambiguous for filenames containing that sequence
Cross-file deduplication
Per-file deltas cannot exploit similarity between different files — each file is compressed/diffed in isolation. Identical or near-identical files in different locations get no benefit from each other. Approaches that could address this:
zstd --trainto build a shared dictionary from the corpus, then compress all deltas against it- Content-addressed storage (deduplicate at the block or file level before delta generation)
- Tar the entire PEND tree and delta against the previous tar (single-stream, cross-file repetition is visible to the compressor — but random access for restore becomes harder)
These are significant complexity increases and out of scope for now.
File attribute tracking (TODO)
Currently the manifest records only file content changes. File metadata (permissions, mtime, ownership, xattrs) is not tracked, meaning restore cannot faithfully reconstruct the original state.
Planned approach:
- On each run, compare attributes between PREV and PEND for every file in the change list
- Encode attribute changes explicitly in the manifest alongside content changes
- Restore walks the delta chain applying both content deltas and attribute deltas in sequence
Design considerations:
fs.stat()gives mode, mtime, uid, gid — but not xattrs, ACLs, or fs-specific attributes- Attribute richness is highly filesystem-dependent (ext4, btrfs, APFS, NTFS all differ)
- Need a pluggable attribute backend, similar to the delta backend, so the attribute set captured and restored can be tuned per deployment without changing core logic
- Restore must handle the case where an attribute from an older delta is no longer representable on the target filesystem (e.g. restoring to a different fs type) — fail loudly rather than silently skip
- rsync
-aalready preserves attributes into PEND, so PEND is always the authoritative source of truth for what attributes should be at that point in time
Occasional Snapshots
Delta chains are efficient but fragile over long chains. Periodic full snapshots (every N deltas, or on demand) bound the reconstruction blast radius. Snapshot support is planned but not in scope for initial implementation.
Implementation Phases
- Phase 1 (now): Arg parsing, config, dry-run, guards, rsync steps
- Phase 2: Delta generation with zstd backend, manifest writing, atomic commit
- Phase 3: PREV promotion, state.json management, recovery logic
- Phase 4:
statusandrestorecommands - Future: Additional backends, snapshot support, scheduling