mikael-lovqvists-claude-agent/video-setup

Files

mikael-lovqvists-claude-agent 92ba1adf29 docs: add discovery flow diagrams, document restart detection limitation

- Four Mermaid sequence diagrams: startup, steady-state keepalive, node
  loss/timeout, and node restart
- Explicitly document that the site_id-change restart heuristic does not work
  in practice (site_id is static config, not a runtime value)
- Describe what needs to change: a boot nonce (random u32 at startup)
- Add boot nonce as a deferred item in planning.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-30 04:39:52 +00:00

8.3 KiB

Raw Permalink Blame History

Node Discovery and Multi-Site

See Architecture Overview.

Node Discovery

Standard mDNS (RFC 6762) uses UDP multicast over 224.0.0.251:5353 with DNS-SD service records. The wire protocol is well-defined and the multicast group is already in active use on most LANs. The standard service discovery stack (Avahi, Bonjour, nss-mdns) provides that transport but brings significant overhead: persistent daemons, D-Bus dependencies, complex configuration surface, and substantial resident memory. None of that is needed here.

The approach: reuse the multicast transport, define our own wire format.

Rather than DNS wire format, node announcements are encoded as binary frames using the same serialization layer (serial) and frame header used for video transport. A node joins the multicast group, broadcasts periodic announcements, and listens for announcements from peers.

Announcement Frame

Field	Size	Purpose
`message_type`	2 bytes	Discovery message type (e.g. `0x0010` for node announcement)
`channel_id`	2 bytes	Reserved / zero
`payload_length`	4 bytes	Byte length of payload
Payload	variable	Encoded node identity and capabilities

Payload fields:

Field	Type	Purpose
`protocol_version`	u8	Wire format version
`site_id`	u16	Site this node belongs to (`0` = local / unassigned)
`tcp_port`	u16	Port where this node accepts transport connections
`function_flags`	u16	Bitfield declaring node capabilities (see below)
`name_len`	u8	Length of name string
`name`	bytes	Node name (`namespace:instance`, e.g. `v4l2:microscope`)

function_flags bits:

Bit	Mask	Meaning
0	`0x0001`	Source — produces video
1	`0x0002`	Relay — receives and distributes streams
2	`0x0004`	Sink — consumes video (display, archiver, etc.)
3	`0x0008`	Controller — participates in control plane coordination

A node may set multiple bits — a relay that also archives sets both RELAY and SINK.

Behaviour

Nodes send announcements periodically (default every 5 s) and immediately on startup via multicast
No daemon — the node process itself sends and listens; no background service required
On receiving an announcement the node records the peer (address, port, name, capabilities) and can initiate a transport connection if needed
A peer that goes silent for timeout_intervals × interval_ms is considered offline and removed from the peer table
Announcements are informational only — identity is validated at TCP connection time

Startup — new node joins the network

sequenceDiagram
    participant N as New Node
    participant MC as Multicast group
    participant A as Node A
    participant B as Node B

    N->>MC: announce (multicast)
    MC-->>A: receives announce
    MC-->>B: receives announce
    A->>N: announce (unicast reply)
    B->>N: announce (unicast reply)
    Note over N,B: All parties now know each other.<br/>Subsequent keepalives are multicast only.

Each node that hears a new peer sends a unicast reply directly to that peer. This allows the new node to populate its peer table within one round-trip rather than waiting up to interval_ms for other nodes' next scheduled broadcast.

Steady-state keepalive

sequenceDiagram
    participant A as Node A
    participant MC as Multicast group
    participant B as Node B
    participant C as Node C

    loop every interval_ms
        A->>MC: announce (multicast)
        MC-->>B: receives — updates last_seen_ms, no reply
        MC-->>C: receives — updates last_seen_ms, no reply
    end

Known peers update their last_seen_ms timestamp and do nothing else. No reply is sent, so there is no amplification.

Node loss — timeout

sequenceDiagram
    participant A as Node A
    participant B as Node B (offline)

    Note over B: Node B stops sending
    loop timeout_intervals × interval_ms elapses
        A->>A: check_timeouts() — not yet expired
    end
    A->>A: check_timeouts() — expired, remove B
    A->>A: on_peer_lost(B) callback

Node restart — known limitation

The current implementation attempts to detect a restart by checking whether site_id changed for a known (addr, port) entry. In practice this does not work: site_id is a static configuration value and will be the same before and after a restart. A restarted node will therefore simply be treated as a continuing keepalive and will not receive an immediate unicast reply — it will have to wait up to interval_ms for the next scheduled multicast broadcast from its peers.

sequenceDiagram
    participant R as Restarted Node
    participant MC as Multicast group
    participant A as Node A

    Note over R: Node restarts — same addr, port, site_id
    R->>MC: announce (multicast)
    MC-->>A: receives — site_id unchanged, treated as keepalive
    Note over A: No unicast reply sent. R waits up to interval_ms<br/>to learn about A via A's next scheduled multicast.

What needs to change: a boot nonce (random u32 generated at startup, not configured) should be added to the announcement payload. A change in boot nonce for a known peer unambiguously signals a restart and triggers an immediate unicast reply. This requires a wire format version bump and updates to the peer table struct, announcement builder, and receive logic.

No Avahi/Bonjour Dependency

The system does not link against, depend on, or interact with Avahi or Bonjour. It opens a raw UDP multicast socket directly, which requires only standard POSIX socket APIs. This keeps the runtime dependency footprint minimal and the behaviour predictable.

Multi-Site (Forward Compatibility)

The immediate use case is a single LAN. A planned future use case is site-to-site linking — two independent networks (e.g. a lab and a remote location) connected by a tunnel (SSH port-forward, WireGuard, etc.), where nodes on both sites are reachable from either side.

Site Identity

Every node carries a site_id (u16) in its announcement. In a single-site deployment this is always 0. When sites are joined, each site is assigned a distinct non-zero ID; nodes retain their IDs across the join and are fully addressable by (site_id, name) from anywhere in the combined network.

This field is reserved from day one so that multi-site never requires a wire format change or a rename of existing identifiers.

Site Gateway Node

A site gateway is a node that participates in both networks simultaneously — it has a connection on the local transport and a connection over the inter-site tunnel. It:

Bridges discovery announcements between sites (rewriting site_id appropriately)
Forwards encapsulated transport frames across the tunnel on behalf of cross-site edges
Is itself a named node, so the control plane can see and reason about it

The tunnel transport is out of scope for now. The gateway is a node type, not a special infrastructure component — it uses the same wire protocol as everything else.

Site ID Translation

Both sides of a site-to-site link will independently default to site_id = 0. A gateway cannot simply forward announcements across the boundary — every node on both sides would appear as site 0 and be indistinguishable.

The gateway is responsible for site ID translation: it assigns a distinct non-zero site_id to each side of the link and rewrites the site_id field in all announcements and any protocol messages that carry a site_id as they cross the boundary. From each side's perspective, remote nodes appear with the translated ID assigned by the gateway; local nodes retain their own IDs.

This means site_id = 0 should be treated as "local / unassigned" and never forwarded across a site boundary without translation. A node that receives an announcement with site_id = 0 on a cross-site link should treat it as a protocol error from the gateway.

Addressing

A fully-qualified node address is site_id:namespace:instance. Within a single site, site_id is implicit and can be omitted. The control plane and discovery layer must store site_id alongside every peer record from the start, even if it is always 0, so that the upgrade to multi-site addressing requires only configuration and a gateway node — not code changes.

8.3 KiB Raw Permalink Blame History Unescape Escape