From c58c211fee0eb8904a7360df44de20622ec7ee58 Mon Sep 17 00:00:00 2001 From: mikael-lovqvists-claude-agent Date: Wed, 25 Mar 2026 22:37:22 +0000 Subject: [PATCH] Document device resilience and stream lifecycle signals Adds stream_event message type (0x0004) with interrupted/resumed codes for encapsulated edges. Documents per-layer implications: opaque stream limitation, ingest parser reset requirement, frame allocator abandon operation, and source node recovery loop structure. Co-Authored-By: Claude Sonnet 4.6 --- architecture.md | 55 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/architecture.md b/architecture.md index b7acdea..dbea1c6 100644 --- a/architecture.md +++ b/architecture.md @@ -127,8 +127,9 @@ Header fields: | `0x0001` | Video frame | | `0x0002` | Control request | | `0x0003` | Control response | +| `0x0004` | Stream event | -Video frame payloads are raw compressed frames. Control payloads are binary-serialized structures — see [Protocol Serialization](#protocol-serialization). +Video frame payloads are raw compressed frames. Control payloads are binary-serialized structures — see [Protocol Serialization](#protocol-serialization). Stream events carry lifecycle signals for a channel — see [Device Resilience](#device-resilience). ### Unified Control and Video on One Connection @@ -158,6 +159,7 @@ Control messages are low-volume and can be interleaved with the video frame stre | Sequence numbers / timestamps | **no** | yes (via extension) | | Control / command channel | **no** | yes | | Remote device enumeration | **no** | yes | +| Stream lifecycle signals | **no** | yes | The most important forcing function is **low-latency relay**: to drop a pending frame when a newer one arrives, the relay must know where frames begin and end. An opaque stream cannot support this, so any edge that requires low-latency output must use encapsulation. @@ -240,6 +242,57 @@ graph TD --- +## Device Resilience + +Nodes that read from hardware devices (V4L2 cameras, media devices) must handle transient device loss — a USB camera that disconnects and reconnects, a device node that briefly disappears during a mode switch, or a stream that errors out and can be retried. This is not an early implementation concern but has structural implications that should be respected from the start. + +### The Problem by Layer + +**Source node / device reader** + +A device is opened by fd. On a transient disconnect, the fd becomes invalid — reads return errors or short counts. The device may reappear under the same path after some time. Recovery requires closing the bad fd, waiting or polling for the device to reappear, reopening, and restarting the capture loop. Any state tied to the old fd (ioctl configuration, stream-on status) must be re-established. + +**Opaque stream edge** + +The downstream receiver sees bytes stop. There is no mechanism in an opaque stream to distinguish "slow source", "dead source", or "recovered source". A reconnection produces a new byte stream that appears continuous to the receiver — but contains a hard discontinuity. The receiver has no way to know it should reset state. This is a known limitation of opaque mode. If the downstream consumer is sensitive to stream discontinuities (e.g. a frame parser), it must use encapsulated mode on that edge. + +**Encapsulated stream edge** + +The source node sends a `stream_event` message (`0x0004`) on the affected `channel_id` before the bytes stop (if possible) or as the first message when stream resumes. The payload carries an event code: + +| Code | Meaning | +|---|---| +| `0x01` | Stream interrupted — device lost, bytes will stop | +| `0x02` | Stream resumed — device recovered, frames will follow | + +On receiving `stream_interrupted`, downstream nodes know to discard any partial frame being assembled and reset parser state. On `stream_resumed`, they know a clean frame boundary follows and can restart cleanly. + +**Ingest module (MJPEG parser)** + +The two-pass EOI state machine is stateful per stream. It must expose an explicit reset operation that discards any partial frame in progress and returns the parser to a clean initial state. This reset is triggered by a `stream_interrupted` event, or by any read error from the device. Any frame allocation begun for the discarded partial frame must be released before the reset completes. + +**Frame allocator** + +A partial frame that was being assembled when the device dropped must be explicitly abandoned. The allocator must support an `abandon` operation distinct from a normal `release` — abandon means the allocation is invalid and any reference tracking for it should be unwound immediately. This prevents a partial allocation from sitting in the accounting tables and consuming budget. + +### Source Node Recovery Loop + +The general structure for a resilient device reader (not yet implemented, for design awareness): + +1. Open device, configure, start capture +2. On read error: emit `stream_interrupted` on the transport, close fd, enter retry loop +3. Poll for device reappearance (inotify on `/dev`, or timed retry) +4. On device back: reopen, reconfigure (ioctl state is lost), emit `stream_resumed`, resume capture +5. Log reconnection events to the control plane as observable signals + +The retry loop must be bounded — a device that never returns should eventually cause the node to report a permanent failure rather than loop indefinitely. + +### Implications for Opaque Streams + +If a source node is producing an opaque stream and the device drops, the TCP connection itself may remain open while bytes stop flowing. The downstream node only learns something is wrong via a timeout or its own read error. For this reason, **opaque streams should only be used on edges where the downstream consumer either does not care about discontinuities or has its own out-of-band mechanism to detect them**. Edges into an ingest node must use encapsulated mode. + +--- + ## Implementation Approach The system is built module by module in C11. Each translation unit is developed and validated independently before being integrated. See [planning.md](planning.md) for current status and module order, and [conventions.md](conventions.md) for code and project conventions.