- Four Mermaid sequence diagrams: startup, steady-state keepalive, node loss/timeout, and node restart - Explicitly document that the site_id-change restart heuristic does not work in practice (site_id is static config, not a runtime value) - Describe what needs to change: a boot nonce (random u32 at startup) - Add boot nonce as a deferred item in planning.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
159 lines
8.3 KiB
Markdown
159 lines
8.3 KiB
Markdown
# Node Discovery and Multi-Site
|
||
|
||
See [Architecture Overview](../architecture.md).
|
||
|
||
## Node Discovery
|
||
|
||
Standard mDNS (RFC 6762) uses UDP multicast over `224.0.0.251:5353` with DNS-SD service records. The wire protocol is well-defined and the multicast group is already in active use on most LANs. The standard service discovery stack (Avahi, Bonjour, `nss-mdns`) provides that transport but brings significant overhead: persistent daemons, D-Bus dependencies, complex configuration surface, and substantial resident memory. None of that is needed here.
|
||
|
||
The approach: **reuse the multicast transport, define our own wire format**.
|
||
|
||
Rather than DNS wire format, node announcements are encoded as binary frames using the same serialization layer (`serial`) and frame header used for video transport. A node joins the multicast group, broadcasts periodic announcements, and listens for announcements from peers.
|
||
|
||
### Announcement Frame
|
||
|
||
| Field | Size | Purpose |
|
||
|---|---|---|
|
||
| `message_type` | 2 bytes | Discovery message type (e.g. `0x0010` for node announcement) |
|
||
| `channel_id` | 2 bytes | Reserved / zero |
|
||
| `payload_length` | 4 bytes | Byte length of payload |
|
||
| Payload | variable | Encoded node identity and capabilities |
|
||
|
||
Payload fields:
|
||
|
||
| Field | Type | Purpose |
|
||
|---|---|---|
|
||
| `protocol_version` | u8 | Wire format version |
|
||
| `site_id` | u16 | Site this node belongs to (`0` = local / unassigned) |
|
||
| `tcp_port` | u16 | Port where this node accepts transport connections |
|
||
| `function_flags` | u16 | Bitfield declaring node capabilities (see below) |
|
||
| `name_len` | u8 | Length of name string |
|
||
| `name` | bytes | Node name (`namespace:instance`, e.g. `v4l2:microscope`) |
|
||
|
||
`function_flags` bits:
|
||
|
||
| Bit | Mask | Meaning |
|
||
|---|---|---|
|
||
| 0 | `0x0001` | Source — produces video |
|
||
| 1 | `0x0002` | Relay — receives and distributes streams |
|
||
| 2 | `0x0004` | Sink — consumes video (display, archiver, etc.) |
|
||
| 3 | `0x0008` | Controller — participates in control plane coordination |
|
||
|
||
A node may set multiple bits — a relay that also archives sets both `RELAY` and `SINK`.
|
||
|
||
### Behaviour
|
||
|
||
- Nodes send announcements periodically (default every 5 s) and immediately on startup via multicast
|
||
- No daemon — the node process itself sends and listens; no background service required
|
||
- On receiving an announcement the node records the peer (address, port, name, capabilities) and can initiate a transport connection if needed
|
||
- A peer that goes silent for `timeout_intervals × interval_ms` is considered offline and removed from the peer table
|
||
- Announcements are informational only — identity is validated at TCP connection time
|
||
|
||
#### Startup — new node joins the network
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant N as New Node
|
||
participant MC as Multicast group
|
||
participant A as Node A
|
||
participant B as Node B
|
||
|
||
N->>MC: announce (multicast)
|
||
MC-->>A: receives announce
|
||
MC-->>B: receives announce
|
||
A->>N: announce (unicast reply)
|
||
B->>N: announce (unicast reply)
|
||
Note over N,B: All parties now know each other.<br/>Subsequent keepalives are multicast only.
|
||
```
|
||
|
||
Each node that hears a new peer sends a **unicast reply** directly to that peer. This allows the new node to populate its peer table within one round-trip rather than waiting up to `interval_ms` for other nodes' next scheduled broadcast.
|
||
|
||
#### Steady-state keepalive
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant A as Node A
|
||
participant MC as Multicast group
|
||
participant B as Node B
|
||
participant C as Node C
|
||
|
||
loop every interval_ms
|
||
A->>MC: announce (multicast)
|
||
MC-->>B: receives — updates last_seen_ms, no reply
|
||
MC-->>C: receives — updates last_seen_ms, no reply
|
||
end
|
||
```
|
||
|
||
Known peers update their `last_seen_ms` timestamp and do nothing else. No reply is sent, so there is no amplification.
|
||
|
||
#### Node loss — timeout
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant A as Node A
|
||
participant B as Node B (offline)
|
||
|
||
Note over B: Node B stops sending
|
||
loop timeout_intervals × interval_ms elapses
|
||
A->>A: check_timeouts() — not yet expired
|
||
end
|
||
A->>A: check_timeouts() — expired, remove B
|
||
A->>A: on_peer_lost(B) callback
|
||
```
|
||
|
||
#### Node restart — known limitation
|
||
|
||
The current implementation attempts to detect a restart by checking whether `site_id` changed for a known `(addr, port)` entry. In practice this **does not work**: `site_id` is a static configuration value and will be the same before and after a restart. A restarted node will therefore simply be treated as a continuing keepalive and will not receive an immediate unicast reply — it will have to wait up to `interval_ms` for the next scheduled multicast broadcast from its peers.
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant R as Restarted Node
|
||
participant MC as Multicast group
|
||
participant A as Node A
|
||
|
||
Note over R: Node restarts — same addr, port, site_id
|
||
R->>MC: announce (multicast)
|
||
MC-->>A: receives — site_id unchanged, treated as keepalive
|
||
Note over A: No unicast reply sent. R waits up to interval_ms<br/>to learn about A via A's next scheduled multicast.
|
||
```
|
||
|
||
**What needs to change:** a **boot nonce** (random `u32` generated at startup, not configured) should be added to the announcement payload. A change in boot nonce for a known peer unambiguously signals a restart and triggers an immediate unicast reply. This requires a wire format version bump and updates to the peer table struct, announcement builder, and receive logic.
|
||
|
||
### No Avahi/Bonjour Dependency
|
||
|
||
The system does not link against, depend on, or interact with Avahi or Bonjour. It opens a raw UDP multicast socket directly, which requires only standard POSIX socket APIs. This keeps the runtime dependency footprint minimal and the behaviour predictable.
|
||
|
||
---
|
||
|
||
## Multi-Site (Forward Compatibility)
|
||
|
||
The immediate use case is a single LAN. A planned future use case is **site-to-site linking** — two independent networks (e.g. a lab and a remote location) connected by a tunnel (SSH port-forward, WireGuard, etc.), where nodes on both sites are reachable from either side.
|
||
|
||
### Site Identity
|
||
|
||
Every node carries a `site_id` (`u16`) in its announcement. In a single-site deployment this is always `0`. When sites are joined, each site is assigned a distinct non-zero ID; nodes retain their IDs across the join and are fully addressable by `(site_id, name)` from anywhere in the combined network.
|
||
|
||
This field is reserved from day one so that multi-site never requires a wire format change or a rename of existing identifiers.
|
||
|
||
### Site Gateway Node
|
||
|
||
A site gateway is a node that participates in both networks simultaneously — it has a connection on the local transport and a connection over the inter-site tunnel. It:
|
||
|
||
- Bridges discovery announcements between sites (rewriting `site_id` appropriately)
|
||
- Forwards encapsulated transport frames across the tunnel on behalf of cross-site edges
|
||
- Is itself a named node, so the control plane can see and reason about it
|
||
|
||
The tunnel transport is out of scope for now. The gateway is a node type, not a special infrastructure component — it uses the same wire protocol as everything else.
|
||
|
||
### Site ID Translation
|
||
|
||
Both sides of a site-to-site link will independently default to `site_id = 0`. A gateway cannot simply forward announcements across the boundary — every node on both sides would appear as site 0 and be indistinguishable.
|
||
|
||
The gateway is responsible for **site ID translation**: it assigns a distinct non-zero `site_id` to each side of the link and rewrites the `site_id` field in all announcements and any protocol messages that carry a `site_id` as they cross the boundary. From each side's perspective, remote nodes appear with the translated ID assigned by the gateway; local nodes retain their own IDs.
|
||
|
||
This means `site_id = 0` should be treated as "local / unassigned" and never forwarded across a site boundary without translation. A node that receives an announcement with `site_id = 0` on a cross-site link should treat it as a protocol error from the gateway.
|
||
|
||
### Addressing
|
||
|
||
A fully-qualified node address is `site_id:namespace:instance`. Within a single site, `site_id` is implicit and can be omitted. The control plane and discovery layer must store `site_id` alongside every peer record from the start, even if it is always `0`, so that the upgrade to multi-site addressing requires only configuration and a gateway node — not code changes.
|