docs: split architecture.md into focused sub-documents
architecture.md is now a concise overview (~155 lines) with a Documentation section linking to all sub-docs. New sub-docs in docs/: transport.md — wire modes, frame header, serialization, web peer relay.md — delivery modes, memory model, congestion, scheduler codec.md — stream metadata, format negotiation, codec backends xorg.md — screen grab, viewer sink, render loop, overlays discovery.md — multicast announcements, multi-site, site gateways node-state.md — wanted/current state, reconciler, stats, queries device-resilience.md — device loss handling, stream events, audio (future) All cross-references updated to file links. Every sub-doc links back to architecture.md. docs/transport.md links to docs/protocol.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
62
docs/device-resilience.md
Normal file
62
docs/device-resilience.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# Device Resilience
|
||||
|
||||
See [Architecture Overview](../architecture.md).
|
||||
|
||||
Nodes that read from hardware devices (V4L2 cameras, media devices) must handle transient device loss — a USB camera that disconnects and reconnects, a device node that briefly disappears during a mode switch, or a stream that errors out and can be retried. This is not an early implementation concern but has structural implications that should be respected from the start.
|
||||
|
||||
## The Problem by Layer
|
||||
|
||||
**Source node / device reader**
|
||||
|
||||
A device is opened by fd. On a transient disconnect, the fd becomes invalid — reads return errors or short counts. The device may reappear under the same path after some time. Recovery requires closing the bad fd, waiting or polling for the device to reappear, reopening, and restarting the capture loop. Any state tied to the old fd (ioctl configuration, stream-on status) must be re-established.
|
||||
|
||||
**Opaque stream edge**
|
||||
|
||||
The downstream receiver sees bytes stop. There is no mechanism in an opaque stream to distinguish "slow source", "dead source", or "recovered source". A reconnection produces a new byte stream that appears continuous to the receiver — but contains a hard discontinuity. The receiver has no way to know it should reset state. This is a known limitation of opaque mode. If the downstream consumer is sensitive to stream discontinuities (e.g. a frame parser), it must use encapsulated mode on that edge.
|
||||
|
||||
**Encapsulated stream edge**
|
||||
|
||||
The source node sends a `stream_event` message (`0x0004`) on the affected `channel_id` before the bytes stop (if possible) or as the first message when stream resumes. The payload carries an event code:
|
||||
|
||||
| Code | Meaning |
|
||||
|---|---|
|
||||
| `0x01` | Stream interrupted — device lost, bytes will stop |
|
||||
| `0x02` | Stream resumed — device recovered, frames will follow |
|
||||
|
||||
On receiving `stream_interrupted`, downstream nodes know to discard any partial frame being assembled and reset parser state. On `stream_resumed`, they know a clean frame boundary follows and can restart cleanly.
|
||||
|
||||
**Ingest module (MJPEG parser)**
|
||||
|
||||
The two-pass EOI state machine is stateful per stream. It must expose an explicit reset operation that discards any partial frame in progress and returns the parser to a clean initial state. This reset is triggered by a `stream_interrupted` event, or by any read error from the device. Any frame allocation begun for the discarded partial frame must be released before the reset completes.
|
||||
|
||||
**Frame allocator**
|
||||
|
||||
A partial frame that was being assembled when the device dropped must be explicitly abandoned. The allocator must support an `abandon` operation distinct from a normal `release` — abandon means the allocation is invalid and any reference tracking for it should be unwound immediately. This prevents a partial allocation from sitting in the accounting tables and consuming budget.
|
||||
|
||||
## Source Node Recovery Loop
|
||||
|
||||
The general structure for a resilient device reader (not yet implemented, for design awareness):
|
||||
|
||||
1. Open device, configure, start capture
|
||||
2. On read error: emit `stream_interrupted` on the transport, close fd, enter retry loop
|
||||
3. Poll for device reappearance (inotify on `/dev`, or timed retry)
|
||||
4. On device back: reopen, reconfigure (ioctl state is lost), emit `stream_resumed`, resume capture
|
||||
5. Log reconnection events to the control plane as observable signals
|
||||
|
||||
The retry loop must be bounded — a device that never returns should eventually cause the node to report a permanent failure rather than loop indefinitely.
|
||||
|
||||
## Implications for Opaque Streams
|
||||
|
||||
If a source node is producing an opaque stream and the device drops, the TCP connection itself may remain open while bytes stop flowing. The downstream node only learns something is wrong via a timeout or its own read error. For this reason, **opaque streams should only be used on edges where the downstream consumer either does not care about discontinuities or has its own out-of-band mechanism to detect them**. Edges into an ingest node must use encapsulated mode.
|
||||
|
||||
---
|
||||
|
||||
## Audio (Future)
|
||||
|
||||
Audio streams are not in scope for the initial implementation but the transport is designed to accommodate them without structural changes.
|
||||
|
||||
A future audio stream is just another message type on an existing transport connection — no new connection type or header field is needed. `stream_id` in the payload already handles multiplexing. The message type table has room for an `audio_frame` type alongside `video_frame`.
|
||||
|
||||
The main open question is codec and container: raw PCM is trivial to handle but large; compressed formats (Opus, AAC) need framing conventions. This is deferred until video is solid.
|
||||
|
||||
The frame allocator, relay, and archive modules should not assume that a frame implies video — they operate on opaque byte payloads with a message type and length, so audio frames will pass through the same infrastructure unchanged.
|
||||
Reference in New Issue
Block a user