docs: add discovery flow diagrams, document restart detection limitation

- Four Mermaid sequence diagrams: startup, steady-state keepalive, node
  loss/timeout, and node restart
- Explicitly document that the site_id-change restart heuristic does not work
  in practice (site_id is static config, not a runtime value)
- Describe what needs to change: a boot nonce (random u32 at startup)
- Add boot nonce as a deferred item in planning.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-30 04:39:52 +00:00
parent f3a6be0701
commit 92ba1adf29
2 changed files with 72 additions and 7 deletions

View File

@@ -136,4 +136,5 @@ These are open questions tracked in `architecture.md` that do not need to be res
- controller_cli is a temporary dev tool; the long-term replacement is a dedicated `controller` binary outside `dev/cli/` that maintains simultaneous connections to all discovered nodes (not switching between them). Commands address a specific node by peer index. This mirrors the web UI's model of administering the whole network rather than one node at a time. The `connect` / active-connection model in the current controller_cli is an interim design choice that should not be carried forward.
- start-ingest peer addressing: the `dest_host` + `dest_port` in START_INGEST is awkward to type manually and requires the caller to know the target's TCP port. Should accept a peer ID (index from the discovered peer table on the node) so the node can resolve the address itself. Requires the node to run discovery and expose its peer table.
- Connection multiplexing: currently each ingest stream opens its own outbound TCP connection to the destination. Multiple streams between the same two peers should share one connection, with stream_id used to demultiplex frames. This is the priority/encapsulation scheme described in the architecture — high-priority and low-latency frames from different streams travel over the same socket rather than competing across separate sockets.
- Discovery boot nonce: the announcement payload needs a `boot_nonce` field (random u32 generated at startup, not configured). The current restart detection uses `site_id` change as a proxy, but `site_id` is static config and does not change on restart, so restarts are not detected and the restarted node waits up to `interval_ms` for peers to reply. Adding a boot nonce gives a reliable restart signal: a nonce change for a known (addr, port) entry triggers an immediate unicast reply. Requires a wire format version bump, peer table struct update, and changes to the announcement builder and receive logic.
- Control grouping: controls should be organizable into named groups for both display organisation (collapsible sections in a UI) and protocol semantics (enumerate controls within a group, set a group of related controls atomically). Relevant for display devices where scale_mode, anchor, position, and size are logically related, and for cameras where white balance, exposure, and gain belong together. The current flat list of (control_id, name, type, value) tuples does not capture this.