- Delta files now named 0.zst, 1.zst etc — avoids path length issues and ambiguous separator substitution; manifest maps index to path - PLAN.md: document delta naming rationale - PLAN.md: document cross-file deduplication limitation and possible future approaches (zstd dictionary training, content-addressing, tar stream)
196 lines
7.4 KiB
Markdown
196 lines
7.4 KiB
Markdown
# delta-backup — Planning Document
|
|
|
|
## Concept
|
|
|
|
A CLI tool for space-efficient directory backups using binary deltas. Instead of storing full
|
|
snapshots each run, it stores the *difference* between the previous and current state, making
|
|
backup storage grow proportionally to what actually changed.
|
|
|
|
## Directory Roles
|
|
|
|
| Name | Purpose |
|
|
|--------|---------|
|
|
| SOURCE | Live data, possibly remote (e.g. rsync-accessible path) |
|
|
| PREV | Last known good state — the base for delta generation |
|
|
| PEND | Working area — assembled current state before diffing |
|
|
| DELTAS | Stored deltas + manifests + state tracking |
|
|
|
|
## Full Run Sequence
|
|
|
|
1. **Clear PEND** — remove all contents
|
|
2. **rsync PREV → PEND** — seed locally (fast)
|
|
3. **rsync SOURCE → PEND** — apply remote changes (only diffs travel over the wire)
|
|
4. **Generate delta** — parse rsync itemize output to get change list, produce per-file deltas + manifest
|
|
5. **Commit delta** — write to DELTAS atomically
|
|
6. **Promote PEND → PREV** — swap working area to become new base
|
|
|
|
## Safety / State Machine
|
|
|
|
Sequence numbers (not timestamps) identify each delta. A `state.json` in DELTAS tracks progress:
|
|
|
|
```json
|
|
{ "next_seq": 5, "last_complete": 4 }
|
|
```
|
|
|
|
Phase transitions are written to state.json so an aborted run can be detected and recovered.
|
|
|
|
**Atomic commit strategy:**
|
|
1. Write delta files to `DELTAS/tmp/N/`
|
|
2. Rename `DELTAS/tmp/N/` → `DELTAS/N/` (atomic on same filesystem)
|
|
3. Promote PEND → PREV
|
|
4. Update state.json
|
|
|
|
The presence of a fully-renamed `DELTAS/N/` directory is the canonical "delta committed" marker.
|
|
State.json is a recoverable cache — can be reconstructed by scanning DELTAS.
|
|
|
|
**Recovery rules:**
|
|
- `DELTAS/N/` exists but `last_complete` is N-1 → finish promotion, update state
|
|
- state.json missing → reconstruct from directory scan
|
|
|
|
## Change Detection
|
|
|
|
No directory walk needed. rsync SOURCE→PEND is run with `--itemize-changes`, producing a
|
|
machine-readable list of exactly what changed. Output is captured (not streamed) and parsed:
|
|
|
|
| rsync prefix | Meaning |
|
|
|-------------|----------|
|
|
| `>f+++++++++` | New file |
|
|
| `>f.st......` | Modified file (any combination of change flags) |
|
|
| `*deleting` | Deleted file |
|
|
| `cd+++++++++` | New directory (ignored for delta purposes) |
|
|
|
|
Lines starting with `>f` or `*deleting` are extracted. The path is the remainder after the
|
|
11-character itemize code + space. This becomes the change list fed directly into delta generation
|
|
— no separate directory walk required.
|
|
|
|
## Delta Format
|
|
|
|
Pluggable backend interface with two operations:
|
|
|
|
```js
|
|
backend.createDelta(prevFile, newFile, outFile) // spawn process, no shell strings
|
|
backend.applyDelta(prevFile, deltaFile, outFile) // spawn process, no shell strings
|
|
```
|
|
|
|
**Default backend: zstd**
|
|
- Modified files: `zstd --patch-from=prev new -o out.zst`
|
|
- New files: `zstd new -o out.zst` (no base)
|
|
- Deleted files: manifest entry only, no delta file
|
|
|
|
**Planned backends:** xdelta3, bsdiff
|
|
|
|
## Manifest Format
|
|
|
|
Each delta `DELTAS/N/` contains:
|
|
- `manifest.json` — lists all changed files with their status (added/modified/deleted) and metadata
|
|
- `files/` — per-file delta or compressed blobs
|
|
|
|
```json
|
|
{
|
|
"seq": 5,
|
|
"timestamp": "2026-03-07T12:00:00Z",
|
|
"prev_seq": 4,
|
|
"backend": "zstd",
|
|
"changes": [
|
|
{ "path": "src/main.js", "status": "modified", "delta": "files/src__main.js.zst" },
|
|
{ "path": "assets/logo.png", "status": "added", "delta": "files/assets__logo.png.zst" },
|
|
{ "path": "old/thing.txt", "status": "deleted" }
|
|
]
|
|
}
|
|
```
|
|
|
|
## CLI Interface
|
|
|
|
```
|
|
delta-backup [options] <command>
|
|
|
|
Commands:
|
|
run Full backup run
|
|
status Show current state (sequences, last run, pending recovery)
|
|
restore Apply deltas to reconstruct a point in time (future)
|
|
|
|
Options:
|
|
--source <path> SOURCE directory (required)
|
|
--prev <path> PREV directory (required)
|
|
--pend <path> PEND directory (required)
|
|
--deltas <path> DELTAS directory (required)
|
|
--backend <name> Delta backend: zstd (default), xdelta3
|
|
--dry-run Print what would happen, execute nothing
|
|
--config <file> Load options from JSON config file (flags override)
|
|
```
|
|
|
|
Guards: refuse to run if any required path is missing from args AND config. Never fall back to
|
|
CWD or implicit defaults for directories — explicit is safer.
|
|
|
|
## Process Spawning
|
|
|
|
All external tools (rsync, zstd, xdelta3) are spawned with explicit argument arrays.
|
|
No shell string interpolation ever. Use Node's `child_process.spawn` or similar.
|
|
|
|
### Planned: Operation Abstractions
|
|
|
|
Currently dry-run logic is scattered inline throughout the run command. The intent is to refactor
|
|
toward self-describing operation objects — each operation knows both how to describe itself (for
|
|
dry-run) and how to execute itself. This makes the run command a clean sequence of operations,
|
|
makes per-tool behavior easy to adjust (e.g. rsync exit code handling), and makes dry-run output
|
|
a natural consequence of the abstraction rather than duplicated conditional logic.
|
|
|
|
Sketch:
|
|
```js
|
|
// Each tool gets its own operation type
|
|
const op = rsyncOp({ args: [...], allowedExitCodes: [0, 24] });
|
|
op.describe(); // prints what it would do
|
|
await op.run(); // executes
|
|
|
|
// Run command becomes:
|
|
const ops = buildOps(config);
|
|
if (dry) ops.forEach(op => op.describe());
|
|
else for (const op of ops) await op.run();
|
|
```
|
|
|
|
Per-tool exit code handling (e.g. rsync's partial transfer codes) lives inside the operation,
|
|
not scattered across callers.
|
|
|
|
### Current: rsync Exit Code Handling
|
|
|
|
rsync meaningful exit codes:
|
|
- `0` — success
|
|
- `23` — partial transfer due to error (fatal)
|
|
- `24` — partial transfer due to vanished source files (acceptable in some cases)
|
|
|
|
Currently basic: any non-zero exit code throws. Finer-grained handling planned as part of the
|
|
operation abstraction refactor.
|
|
|
|
## Known Limitations
|
|
|
|
### Delta file naming
|
|
Delta files are named by numeric index (e.g. `0.zst`, `1.zst`) rather than by path. The manifest
|
|
maps each index to its source path. Path-based naming was considered but rejected because:
|
|
- Deep directory trees can exceed filesystem filename length limits
|
|
- Path separator substitution (e.g. `/` → `__`) is ambiguous for filenames containing that sequence
|
|
|
|
### Cross-file deduplication
|
|
Per-file deltas cannot exploit similarity between different files — each file is compressed/diffed
|
|
in isolation. Identical or near-identical files in different locations get no benefit from each
|
|
other. Approaches that could address this:
|
|
- `zstd --train` to build a shared dictionary from the corpus, then compress all deltas against it
|
|
- Content-addressed storage (deduplicate at the block or file level before delta generation)
|
|
- Tar the entire PEND tree and delta against the previous tar (single-stream, cross-file repetition
|
|
is visible to the compressor — but random access for restore becomes harder)
|
|
|
|
These are significant complexity increases and out of scope for now.
|
|
|
|
## Occasional Snapshots
|
|
|
|
Delta chains are efficient but fragile over long chains. Periodic full snapshots (every N deltas,
|
|
or on demand) bound the reconstruction blast radius. Snapshot support is planned but not in scope
|
|
for initial implementation.
|
|
|
|
## Implementation Phases
|
|
|
|
1. **Phase 1 (now):** Arg parsing, config, dry-run, guards, rsync steps
|
|
2. **Phase 2:** Delta generation with zstd backend, manifest writing, atomic commit
|
|
3. **Phase 3:** PREV promotion, state.json management, recovery logic
|
|
4. **Phase 4:** `status` and `restore` commands
|
|
5. **Future:** Additional backends, snapshot support, scheduling
|