Compare commits

...

10 Commits

Author SHA1 Message Date
8d1d1241b6 Add file attribute tracking TODO to PLAN.md
Document planned approach for capturing and restoring file metadata
(permissions, mtime, uid/gid) alongside content deltas. Notes the need
for a pluggable attribute backend due to filesystem differences, and the
requirement to fail loudly on incompatible attributes during restore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 02:07:09 +00:00
f8829af7a8 Bundle per-file deltas into delta.tar.zst instead of loose files 2026-03-07 01:56:00 +00:00
ba67366cd6 Use numeric indices for delta filenames, document limitations
- Delta files now named 0.zst, 1.zst etc — avoids path length issues
  and ambiguous separator substitution; manifest maps index to path
- PLAN.md: document delta naming rationale
- PLAN.md: document cross-file deduplication limitation and possible
  future approaches (zstd dictionary training, content-addressing, tar stream)
2026-03-07 01:47:31 +00:00
f1faa992c9 Fix missing rename import in run.js 2026-03-07 01:42:58 +00:00
45924cbcd7 Add rsync exit code awareness + plan operation abstraction
- spawn.js: rsync() wrapper handles exit codes 0/24 as OK, 23 as fatal
- spawn.js: capture() accepts allowedExitCodes option
- run.js: all rsync calls go through rsync() wrapper
- PLAN.md: document planned operation abstraction refactor
2026-03-07 01:41:25 +00:00
ab7479e62d Phase 6: replace rm+rename with rsync --delete pend/ prev/ 2026-03-07 01:36:21 +00:00
d1c65a06d5 Replace rm -rf PEND with mkdir -p, use --delete on both rsyncs 2026-03-07 01:32:33 +00:00
e999fca352 Add --base meta-argument to set prev/pend/deltas as subdirs 2026-03-07 01:30:06 +00:00
96e3024991 Implement full run pipeline
- Phase 3: rsync with --itemize-changes captured, parsed into change list
- Phase 4: per-file zstd deltas written to DELTAS/tmp/N/files/
- Phase 5: manifest.json written, atomic rename tmp/N → N
- Phase 6: PEND promoted to PREV via rm+rename
- Dry-run prints all steps without executing
2026-03-07 01:08:30 +00:00
30b90193d7 Add capture() to spawn, add rsync itemize parser 2026-03-07 01:07:51 +00:00
6 changed files with 315 additions and 29 deletions

75
PLAN.md
View File

@@ -127,6 +127,81 @@ CWD or implicit defaults for directories — explicit is safer.
All external tools (rsync, zstd, xdelta3) are spawned with explicit argument arrays. All external tools (rsync, zstd, xdelta3) are spawned with explicit argument arrays.
No shell string interpolation ever. Use Node's `child_process.spawn` or similar. No shell string interpolation ever. Use Node's `child_process.spawn` or similar.
### Planned: Operation Abstractions
Currently dry-run logic is scattered inline throughout the run command. The intent is to refactor
toward self-describing operation objects — each operation knows both how to describe itself (for
dry-run) and how to execute itself. This makes the run command a clean sequence of operations,
makes per-tool behavior easy to adjust (e.g. rsync exit code handling), and makes dry-run output
a natural consequence of the abstraction rather than duplicated conditional logic.
Sketch:
```js
// Each tool gets its own operation type
const op = rsyncOp({ args: [...], allowedExitCodes: [0, 24] });
op.describe(); // prints what it would do
await op.run(); // executes
// Run command becomes:
const ops = buildOps(config);
if (dry) ops.forEach(op => op.describe());
else for (const op of ops) await op.run();
```
Per-tool exit code handling (e.g. rsync's partial transfer codes) lives inside the operation,
not scattered across callers.
### Current: rsync Exit Code Handling
rsync meaningful exit codes:
- `0` — success
- `23` — partial transfer due to error (fatal)
- `24` — partial transfer due to vanished source files (acceptable in some cases)
Currently basic: any non-zero exit code throws. Finer-grained handling planned as part of the
operation abstraction refactor.
## Known Limitations
### Delta file naming
Delta files are named by numeric index (e.g. `0.zst`, `1.zst`) rather than by path. The manifest
maps each index to its source path. Path-based naming was considered but rejected because:
- Deep directory trees can exceed filesystem filename length limits
- Path separator substitution (e.g. `/``__`) is ambiguous for filenames containing that sequence
### Cross-file deduplication
Per-file deltas cannot exploit similarity between different files — each file is compressed/diffed
in isolation. Identical or near-identical files in different locations get no benefit from each
other. Approaches that could address this:
- `zstd --train` to build a shared dictionary from the corpus, then compress all deltas against it
- Content-addressed storage (deduplicate at the block or file level before delta generation)
- Tar the entire PEND tree and delta against the previous tar (single-stream, cross-file repetition
is visible to the compressor — but random access for restore becomes harder)
These are significant complexity increases and out of scope for now.
### File attribute tracking (TODO)
Currently the manifest records only file content changes. File metadata (permissions, mtime,
ownership, xattrs) is not tracked, meaning restore cannot faithfully reconstruct the original
state.
**Planned approach:**
- On each run, compare attributes between PREV and PEND for every file in the change list
- Encode attribute changes explicitly in the manifest alongside content changes
- Restore walks the delta chain applying both content deltas and attribute deltas in sequence
**Design considerations:**
- `fs.stat()` gives mode, mtime, uid, gid — but not xattrs, ACLs, or fs-specific attributes
- Attribute richness is highly filesystem-dependent (ext4, btrfs, APFS, NTFS all differ)
- Need a pluggable attribute backend, similar to the delta backend, so the attribute set captured
and restored can be tuned per deployment without changing core logic
- Restore must handle the case where an attribute from an older delta is no longer representable
on the target filesystem (e.g. restoring to a different fs type) — fail loudly rather than
silently skip
- rsync `-a` already preserves attributes into PEND, so PEND is always the authoritative source
of truth for what attributes should be at that point in time
## Occasional Snapshots ## Occasional Snapshots
Delta chains are efficient but fragile over long chains. Periodic full snapshots (every N deltas, Delta chains are efficient but fragile over long chains. Periodic full snapshots (every N deltas,

View File

@@ -12,9 +12,10 @@ Commands:
Options: Options:
--source <path> SOURCE directory (required) --source <path> SOURCE directory (required)
--prev <path> PREV directory (required) --base <path> Sets --prev, --pend, --deltas as subdirs of base path
--pend <path> PEND directory (required) --prev <path> PREV directory (default: <base>/previous)
--deltas <path> DELTAS directory (required) --pend <path> PEND directory (default: <base>/pending)
--deltas <path> DELTAS directory (default: <base>/deltas)
--backend <name> Delta backend: zstd (default), xdelta3 --backend <name> Delta backend: zstd (default), xdelta3
--config <file> Load options from JSON config file (flags override) --config <file> Load options from JSON config file (flags override)
--dry-run Print what would happen, execute nothing --dry-run Print what would happen, execute nothing
@@ -26,6 +27,7 @@ export function parseArgs(argv) {
args: argv, args: argv,
options: { options: {
source: { type: 'string' }, source: { type: 'string' },
base: { type: 'string' },
prev: { type: 'string' }, prev: { type: 'string' },
pend: { type: 'string' }, pend: { type: 'string' },
deltas: { type: 'string' }, deltas: { type: 'string' },

View File

@@ -1,16 +1,16 @@
/** /**
* run command — full backup run. * run command — full backup run.
*/ */
import { rm, mkdir } from 'fs/promises'; import { mkdir, rename, rm, writeFile } from 'fs/promises';
import { join } from 'path'; import { join } from 'path';
import { run as spawn } from '../spawn.js'; import { run as spawn, rsync } from '../spawn.js';
import { parseItemize } from '../itemize.js';
import { getBackend } from '../backends/index.js'; import { getBackend } from '../backends/index.js';
import { readState, writeState, PHASES } from '../state.js'; import { readState, writeState, PHASES } from '../state.js';
export async function runCommand(config) { export async function runCommand(config) {
const { source, prev, pend, deltas, backend: backendName, dryRun } = config; const { source, prev, pend, deltas, backend: backendName, dryRun: dry } = config;
const backend = getBackend(backendName); const backend = getBackend(backendName);
const dry = dryRun;
if (dry) console.log('[dry-run] No changes will be made.\n'); if (dry) console.log('[dry-run] No changes will be made.\n');
@@ -20,42 +20,130 @@ export async function runCommand(config) {
console.log(`Starting run — seq ${seq} (last complete: ${state.last_complete})`); console.log(`Starting run — seq ${seq} (last complete: ${state.last_complete})`);
// TODO: detect and handle partially-committed previous run // TODO: detect and recover from partially-committed previous run
// ── Phase 1: Clear PEND ───────────────────────────────────── // ── Phase 1: Ensure PEND exists ─────────────────────────────
await setPhase(deltas, state, PHASES.CLEARING_PEND, dry); await setPhase(deltas, state, PHASES.CLEARING_PEND, dry);
console.log('\n── Clear PEND ──');
if (!dry) { if (!dry) {
await rm(pend, { recursive: true, force: true });
await mkdir(pend, { recursive: true }); await mkdir(pend, { recursive: true });
} else { } else {
console.log(`[dry-run] rm -rf ${pend} && mkdir -p ${pend}`); console.log(`[dry-run] mkdir -p ${pend}`);
} }
// ── Phase 2: rsync PREV → PEND (local seed) ───────────────── // ── Phase 2: rsync PREV → PEND (local seed, with delete) ────
await setPhase(deltas, state, PHASES.RSYNC_LOCAL, dry); await setPhase(deltas, state, PHASES.RSYNC_LOCAL, dry);
console.log('\n── rsync PREV → PEND (local seed) ──'); console.log('\n── rsync PREV → PEND (local seed) ──');
await spawn('rsync', ['-aP', trailingSlash(prev), pend], { dryRun: dry }); await rsync(['-aP', '--delete', trailingSlash(prev), trailingSlash(pend)], { dryRun: dry });
// ── Phase 3: rsync SOURCE → PEND (remote changes) ─────────── // ── Phase 3: rsync SOURCE → PEND, capture change list ───────
await setPhase(deltas, state, PHASES.RSYNC_REMOTE, dry); await setPhase(deltas, state, PHASES.RSYNC_REMOTE, dry);
console.log('\n── rsync SOURCE → PEND ──'); console.log('\n── rsync SOURCE → PEND ──');
await spawn('rsync', ['-aP', trailingSlash(source), pend], { dryRun: dry });
// ── Phase 4: Generate delta ────────────────────────────────── const output = await rsync(
['-aP', '--itemize-changes', '--delete', trailingSlash(source), trailingSlash(pend)],
{ dryRun: dry, capture: true },
);
const changes = dry ? [] : parseItemize(output);
if (!dry) {
console.log(` ${changes.length} file(s) changed`);
for (const c of changes) console.log(` [${c.status}] ${c.path}`);
} else {
console.log(' [dry-run] change list determined at runtime');
}
// ── Phase 4: Generate per-file deltas into DELTAS/tmp/N/files/
await setPhase(deltas, state, PHASES.GENERATING, dry); await setPhase(deltas, state, PHASES.GENERATING, dry);
console.log('\n── Generate delta ──'); console.log('\n── Generate delta ──');
// TODO: walk PREV and PEND, diff per file, build manifest
// ── Phase 5: Commit delta ──────────────────────────────────── const tmpDir = join(deltas, 'tmp', String(seq));
const filesDir = join(tmpDir, 'files');
const tarFile = join(tmpDir, 'delta.tar');
const bundleFile = join(tmpDir, 'delta.tar.zst');
if (!dry) {
await mkdir(filesDir, { recursive: true });
} else {
console.log(`[dry-run] mkdir -p ${filesDir}`);
}
const manifestChanges = [];
let fileIndex = 0;
for (const change of changes) {
if (change.status === 'deleted') {
manifestChanges.push({ path: change.path, status: 'deleted' });
continue;
}
const deltaFilename = `${fileIndex}${backend.ext}`;
const outFile = join(filesDir, deltaFilename);
const prevFile = join(prev, change.path);
const newFile = join(pend, change.path);
console.log(` [${change.status}] ${change.path}`);
if (!dry) {
await backend.createDelta(
change.status === 'modified' ? prevFile : null,
newFile,
outFile,
);
} else {
console.log(`[dry-run] ${change.status === 'modified'
? `zstd --patch-from ${prevFile} ${newFile} -o ${outFile}`
: `zstd ${newFile} -o ${outFile}`}`);
}
manifestChanges.push({
path: change.path,
status: change.status,
delta: deltaFilename,
});
fileIndex++;
}
// ── Bundle: tar files/ → delta.tar → delta.tar.zst ──────────
console.log('\n── Bundle deltas ──');
// tar with -C so paths inside the archive are relative (just filenames)
await spawn('tar', ['cf', tarFile, '-C', filesDir, '.'], { dryRun: dry });
await spawn('zstd', [tarFile, '-o', bundleFile, '-f'], { dryRun: dry });
if (!dry) {
await rm(filesDir, { recursive: true });
await rm(tarFile);
} else {
console.log(`[dry-run] rm -rf ${filesDir} ${tarFile}`);
}
// ── Phase 5: Write manifest + atomic commit ──────────────────
await setPhase(deltas, state, PHASES.COMMITTING, dry); await setPhase(deltas, state, PHASES.COMMITTING, dry);
console.log('\n── Commit delta ──'); console.log('\n── Commit delta ──');
// TODO: atomic rename DELTAS/tmp/N → DELTAS/N
const manifest = {
seq,
timestamp: new Date().toISOString(),
prev_seq: state.last_complete,
backend: backendName,
bundle: 'delta.tar.zst',
changes: manifestChanges,
};
const seqDir = join(deltas, String(seq));
if (!dry) {
await writeFile(join(tmpDir, 'manifest.json'), JSON.stringify(manifest, null, 2) + '\n');
// Atomic rename: tmp/N → N
await rename(tmpDir, seqDir);
console.log(` Committed to ${seqDir}`);
} else {
console.log(`[dry-run] write manifest to ${tmpDir}/manifest.json`);
console.log(`[dry-run] rename ${tmpDir}${seqDir}`);
}
// ── Phase 6: Promote PEND → PREV ──────────────────────────── // ── Phase 6: Promote PEND → PREV ────────────────────────────
await setPhase(deltas, state, PHASES.PROMOTING, dry); await setPhase(deltas, state, PHASES.PROMOTING, dry);
console.log('\n── Promote PEND → PREV ──'); console.log('\n── Promote PEND → PREV ──');
// TODO: mv PEND PREV (swap) await rsync(['-aP', '--delete', trailingSlash(pend), trailingSlash(prev)], { dryRun: dry });
// ── Done ───────────────────────────────────────────────────── // ── Done ─────────────────────────────────────────────────────
state.last_complete = seq; state.last_complete = seq;
@@ -63,7 +151,7 @@ export async function runCommand(config) {
state.phase = PHASES.IDLE; state.phase = PHASES.IDLE;
if (!dry) await writeState(deltas, state); if (!dry) await writeState(deltas, state);
console.log(`\nRun complete — seq ${seq} committed.`); console.log(`\nRun complete — seq ${seq} committed. ${manifestChanges.length} file(s) in delta.`);
} }
async function setPhase(deltas, state, phase, dry) { async function setPhase(deltas, state, phase, dry) {

View File

@@ -3,6 +3,7 @@
* CLI args always win. Required paths are validated here. * CLI args always win. Required paths are validated here.
*/ */
import { readFile } from 'fs/promises'; import { readFile } from 'fs/promises';
import { join } from 'path';
const REQUIRED_PATHS = ['source', 'prev', 'pend', 'deltas']; const REQUIRED_PATHS = ['source', 'prev', 'pend', 'deltas'];
const DEFAULTS = { const DEFAULTS = {
@@ -25,6 +26,13 @@ export async function loadConfig(args) {
// CLI args override file config, file config overrides defaults // CLI args override file config, file config overrides defaults
const config = { ...DEFAULTS, ...fileConfig, ...filterDefined(args) }; const config = { ...DEFAULTS, ...fileConfig, ...filterDefined(args) };
// Expand --base into --prev/--pend/--deltas, explicit flags take priority
if (config.base) {
config.prev ??= join(config.base, 'previous');
config.pend ??= join(config.base, 'pending');
config.deltas ??= join(config.base, 'deltas');
}
// Guard: refuse to run if any required path is missing // Guard: refuse to run if any required path is missing
if (config.command === 'run') { if (config.command === 'run') {
const missing = REQUIRED_PATHS.filter(k => !config[k]); const missing = REQUIRED_PATHS.filter(k => !config[k]);

56
lib/itemize.js Normal file
View File

@@ -0,0 +1,56 @@
/**
* Parse rsync --itemize-changes output into a structured change list.
*
* rsync itemize format: 11-character code + space + path
*
* Code structure: YXcstpoguax
* Y = update type: > (transfer), * (message/delete), c (local change), . (no update), h (hard link)
* X = file type: f (file), d (dir), L (symlink), D (device), S (special)
* remaining chars = what changed (size, time, perms, etc.) or '+++++++++' for new
*
* We care about:
* >f... = file transferred (new or modified)
* *deleting = file deleted
* cd... = directory (ignored for delta purposes)
*/
/**
* @typedef {{ status: 'added'|'modified'|'deleted', path: string }} Change
*/
/**
* Parse rsync --itemize-changes stdout into a list of file changes.
* @param {string} output
* @returns {Change[]}
*/
export function parseItemize(output) {
const changes = [];
for (const raw of output.split('\n')) {
const line = raw.trimEnd();
if (!line) continue;
// Deleted files: "*deleting path/to/file"
if (line.startsWith('*deleting ')) {
const path = line.slice('*deleting '.length).trimStart();
// Skip directory deletions (trailing slash)
if (!path.endsWith('/')) {
changes.push({ status: 'deleted', path });
}
continue;
}
// File transfers: ">f......... path" (new or modified)
if (line.length > 12 && line[0] === '>' && line[1] === 'f') {
const code = line.slice(0, 11);
const path = line.slice(12);
const isNew = code.slice(2) === '+++++++++';
changes.push({ status: isNew ? 'added' : 'modified', path });
continue;
}
// Everything else (dirs, symlinks, attribute-only changes) — ignore
}
return changes;
}

View File

@@ -4,17 +4,14 @@
import { spawn } from 'child_process'; import { spawn } from 'child_process';
/** /**
* Spawn a process and stream its output. * Spawn a process and stream its output to stdout/stderr.
* @param {string} cmd * @param {string} cmd
* @param {string[]} args * @param {string[]} args
* @param {{ dryRun?: boolean, label?: string }} opts * @param {{ dryRun?: boolean }} opts
* @returns {Promise<void>} * @returns {Promise<void>}
*/ */
export async function run(cmd, args, { dryRun = false, label } = {}) { export async function run(cmd, args, { dryRun = false } = {}) {
const display = [cmd, ...args].join(' '); console.log(`$ ${[cmd, ...args].join(' ')}`);
if (label) console.log(`[${label}] ${display}`);
else console.log(`$ ${display}`);
if (dryRun) return; if (dryRun) return;
return new Promise((resolve, reject) => { return new Promise((resolve, reject) => {
@@ -26,3 +23,63 @@ export async function run(cmd, args, { dryRun = false, label } = {}) {
}); });
}); });
} }
/**
* Spawn a process and capture stdout as a string.
* stderr is inherited (shown to user). Never used in dry-run context.
* @param {string} cmd
* @param {string[]} args
* @param {{ allowedExitCodes?: number[] }} opts
* @returns {Promise<string>}
*/
export async function capture(cmd, args, { allowedExitCodes = [0] } = {}) {
console.log(`$ ${[cmd, ...args].join(' ')}`);
return new Promise((resolve, reject) => {
const child = spawn(cmd, args, { stdio: ['inherit', 'pipe', 'inherit'] });
const chunks = [];
child.stdout.on('data', chunk => chunks.push(chunk));
child.on('error', reject);
child.on('close', code => {
if (allowedExitCodes.includes(code)) resolve(Buffer.concat(chunks).toString('utf8'));
else reject(new Error(`${cmd} exited with code ${code}`));
});
});
}
// rsync exit codes that are not errors
const RSYNC_OK_CODES = [
0, // success
24, // partial transfer: source files vanished mid-run (acceptable)
];
const RSYNC_ERROR_CODES = {
23: 'partial transfer due to error',
};
/**
* Run rsync with exit code awareness.
* @param {string[]} args
* @param {{ dryRun?: boolean, capture?: boolean }} opts
* @returns {Promise<void | string>}
*/
export async function rsync(args, { dryRun = false, capture: doCapture = false } = {}) {
console.log(`$ rsync ${args.join(' ')}`);
if (dryRun) return doCapture ? '' : undefined;
return new Promise((resolve, reject) => {
const stdio = doCapture ? ['inherit', 'pipe', 'inherit'] : 'inherit';
const child = spawn('rsync', args, { stdio });
const chunks = [];
if (doCapture) child.stdout.on('data', chunk => chunks.push(chunk));
child.on('error', reject);
child.on('close', code => {
if (RSYNC_OK_CODES.includes(code)) {
resolve(doCapture ? Buffer.concat(chunks).toString('utf8') : undefined);
} else {
const reason = RSYNC_ERROR_CODES[code] ?? `unknown error`;
reject(new Error(`rsync exited with code ${code}: ${reason}`));
}
});
});
}