# Video Routing System — Architecture ## Concept A graph-based multi-peer video routing system where nodes are media processes and edges are transport connections. The graph carries video streams between sources, relay nodes, and sinks, with **priority** as a first-class property on paths — so that a low-latency monitoring feed and a high-quality archival feed can coexist and be treated differently by the system. --- ## Design Rationale ### Get It on the Wire First A key principle driving the architecture is that **capture devices should not be burdened with processing**. A Raspberry Pi attached to a camera (V4L2 source) is capable of pulling raw or MJPEG frames off the device, but it is likely too resource-constrained to also transcode, mux, or perform any non-trivial stream manipulation. Doing so would add latency and compete with the capture process itself. The preferred model is: 1. **Pi captures and transmits raw** — reads frames directly from V4L2 (MJPEG or raw Bayer/YUV) and puts them on the wire over TCP as fast as possible, with no local transcoding 2. **A more capable machine receives and defines the stream** — a downstream node with proper CPU/GPU resources receives the raw feed and produces well-formed, containerized, or re-encoded output appropriate for the intended consumers (display, archive, relay) This separation means the Pi's job is purely ingestion and forwarding. It keeps the capture loop tight and latency minimal. The downstream node then becomes the "source" of record for the rest of the graph. This is also why the V4L2 remote control protocol is useful — the Pi doesn't need to run any control logic locally. It exposes its camera parameters over TCP, and the controlling machine adjusts exposure, white balance, codec settings, etc. remotely. The Pi just acts on the commands. --- ## Graph Model ### Nodes Each node is a named process instance, identified by a namespace and name (e.g. `v4l2:microscope`, `ffmpeg:ingest1`, `mpv:preview`, `archiver:main`). Node types: | Type | Role | |---|---| | **Source** | Produces video — V4L2 camera, screen grab, file, test signal | | **Relay** | Receives one or more input streams and distributes to one or more outputs, each with its own delivery mode and buffer; never blocks upstream | | **Sink** | Consumes video — display window, archiver, encoder output | A relay with multiple inputs is what would traditionally be called a mux — it combines streams from several sources and forwards them, possibly over a single transport. The dispatch and buffering logic is the same regardless of input count. ### Edges An edge is a transport connection between two nodes. Edges carry: - The video stream itself (TCP, pipe, or other transport) - A **priority** value - A **transport mode** — opaque or encapsulated (see [Transport Protocol](#transport-protocol)) ### Priority Priority governs how the system allocates resources and makes trade-offs when paths compete: - **High priority (low latency)** — frames are forwarded immediately; buffering is minimized; if a downstream node is slow it gets dropped frames, not delayed ones; quality may be lower - **Low priority (archival)** — frames may be buffered, quality should be maximized; latency is acceptable; dropped frames are undesirable Priority is a property of the *path*, not of the source. The same source can feed a high-priority monitoring path and a low-priority archival path simultaneously. --- ## Control Plane There is no central hub or broker. Nodes communicate directly with each other over the binary transport. Any node can hold the **controller role** (`function_flags` bit 3) — this means it has a user-facing interface (such as the web UI) through which the user can inspect the network, load a topology configuration, and establish or tear down connections between nodes. The controller role is a capability, not a singleton. Multiple nodes could hold it simultaneously; which one a user interacts with is a matter of which they connect to. A node that is purely a source or relay with no UI holds no controller bits. The practical flow is: a user starts a node with the controller role and a web interface, discovers the other nodes on the network via the multicast announcement layer, and uses the UI to configure how streams are routed between them. The controller node issues connection instructions directly to the relevant peers over the binary protocol — there is no intermediary. V4L2 device control and enumeration are carried as control messages within the encapsulated transport on the same connection as video — see [Transport Protocol](#transport-protocol). --- ## Ingestion Pipeline (Raspberry Pi Example) ```mermaid graph LR CAM[V4L2 Camera
dev/video0] -->|raw MJPEG| PI[Pi: ingest node] PI -->|encapsulated stream| RELAY[Relay] RELAY -->|high priority| DISPLAY[Display / Preview
low latency] RELAY -->|low priority| ARCHIVE[Archiver
high quality] CTRL[Controller node
web UI] -.->|V4L2 control
via transport| PI CTRL -.->|connection config| RELAY ``` The Pi runs a node process that dequeues V4L2 buffers and forwards each buffer as an encapsulated frame over TCP. It also exposes the V4L2 control endpoint for remote parameter adjustment. Everything else happens on machines with adequate resources. ### V4L2 Buffer Dequeuing When a V4L2 device is configured for `V4L2_PIX_FMT_MJPEG`, the driver delivers one complete MJPEG frame per dequeued buffer — frame boundaries are guaranteed at the source. The ingest module dequeues these buffers and emits each one as an encapsulated frame directly into the transport. No scanning or frame boundary detection is needed. This is the primary capture path. It is clean, well-defined, and relies on standard V4L2 kernel behaviour rather than heuristics. ### Misbehaving Hardware: `mjpeg_scan` (Future) Some hardware does not honour the per-buffer framing contract — cheap USB webcams or cameras with unusual firmware may concatenate multiple partial frames into a single buffer, or split one frame across multiple buffers. For these cases a separate optional `mjpeg_scan` module provides a fallback: it scans the incoming byte stream for JPEG SOI (`0xFF 0xD8`) and EOI (`0xFF 0xD9`) markers to recover frame boundaries heuristically. This module is explicitly a workaround for non-compliant hardware. It is not part of the primary pipeline and will be implemented only if a specific device requires it. For sources with unusual container formats (AVI-wrapped MJPEG, HTTP multipart, RTSP with quirky packetisation), the preferred approach is to route through ffmpeg rather than write a custom parser. --- ## Transport Protocol Transport between nodes operates in one of two modes. The choice is per-edge and has direct implications for what the relay on that edge can do. ### Opaque Binary Stream The transport forwards bytes as they arrive with no understanding of frame boundaries. The relay acts as a pure byte pipe. - Zero framing overhead - Cannot drop frames (frame boundaries are unknown) - Cannot multiplex multiple streams (no way to distinguish them) - Cannot do per-frame accounting (byte budgets become byte-rate estimates only) - Low-latency output is not available — the relay cannot discard a partial frame This mode is appropriate for simple point-to-point forwarding where the consumer handles all framing, and where the relay has no need for frame-level intelligence. ### Frame-Encapsulated Stream Each message is prefixed with a small fixed-size header. This applies to both video frames and control messages — the transport is unified. Header fields: | Field | Size | Purpose | |---|---|---| | `message_type` | 2 bytes | Determines how the payload is interpreted | | `payload_length` | 4 bytes | Byte length of the following payload | The header is intentionally minimal. Any node — including a relay that does not recognise a message type — can skip or forward the frame by reading exactly `payload_length` bytes without needing to understand the payload. All message-specific identifiers (stream ID, correlation ID, etc.) live inside the payload and are handled by the relevant message type handler. **Message types and their payload structure:** | Value | Type | Payload starts with | |---|---|---| | `0x0001` | Video frame | `stream_id` (u16), then compressed frame data | | `0x0002` | Control request | `request_id` (u16), then command-specific fields | | `0x0003` | Control response | `request_id` (u16), then result-specific fields | | `0x0004` | Stream event | `stream_id` (u16), `event_code` (u8), then event-specific fields | Node-level messages (not tied to any stream or request) have no prefix beyond the header — the payload begins with the message-specific fields directly. Control payloads are binary-serialized structures — see [Protocol Serialization](#protocol-serialization). Stream events carry lifecycle signals — see [Device Resilience](#device-resilience). ### Unified Control and Video on One Connection By carrying control messages on the same transport as video frames, the system avoids managing separate connections per peer. A node that receives a video stream can be queried or commanded over the same socket. This directly enables **remote device enumeration**: a connecting node can issue a control request asking what V4L2 devices the remote host exposes, and receive the list in a control response — before any video streams are established. Discovery and streaming share the same channel. The V4L2 control operations map naturally to control request/response pairs: | Operation | Direction | |---|---| | Enumerate devices | request → response | | Get device controls (parameters, ranges, menus) | request → response | | Get control values | request → response | | Set control values | request → response (ack/fail) | Control messages are low-volume and can be interleaved with the video frame stream without meaningful overhead. ### Capability Implications | Feature | Opaque | Encapsulated | |---|---|---| | Simple forwarding | yes | yes | | Low-latency drop | **no** | yes | | Per-frame byte accounting | **no** | yes | | Multi-stream over one transport | **no** | yes | | Sequence numbers / timestamps | **no** | yes (via extension) | | Control / command channel | **no** | yes | | Remote device enumeration | **no** | yes | | Stream lifecycle signals | **no** | yes | The most important forcing function is **low-latency relay**: to drop a pending frame when a newer one arrives, the relay must know where frames begin and end. An opaque stream cannot support this, so any edge that requires low-latency output must use encapsulation. Opaque streams are a valid optimization for leaf edges where the downstream consumer (e.g. an archiver writing raw bytes to disk) does its own framing, requires no relay intelligence, and has no need for remote control. --- ## Relay Design A relay receives frames from one or more upstream sources and distributes them to any number of outputs. Each output is independently configured with a **delivery mode** that determines how it handles the tension between latency and completeness. ### Output Delivery Modes **Low-latency mode** — minimize delay, accept loss The output holds at most one pending frame. When a new frame arrives: - If the slot is empty, the frame occupies it and is sent as soon as the transport allows - If the slot is already occupied (transport not ready), the incoming frame is dropped — the pending frame is already stale enough The consumer always receives the most recent frame the transport could deliver. Frame loss is expected and acceptable. **Completeness mode** — minimize loss, accept delay The output maintains a queue. When a new frame arrives it is enqueued. The transport drains the queue in order. When the queue is full, a drop policy is applied — either drop the oldest frame (preserve recency) or drop the newest (preserve continuity). Which policy fits depends on the consumer: an archiver may prefer continuity; a scrubber may prefer recency. ### Memory Model Compressed frames have variable sizes (I-frames vs P-frames, quality settings, scene complexity), so fixed-slot buffers waste memory unpredictably. The preferred model is **per-frame allocation** with explicit bookkeeping. Each allocated frame is tracked with at minimum: - Byte size - Sequence number or timestamp - Which outputs still hold a reference Limits are enforced per output independently — not as a shared pool — so a slow completeness output cannot starve a low-latency output or exhaust global memory. Per-output limits have two axes: - **Frame count** — cap on number of queued frames - **Byte budget** — cap on total bytes in flight for that output Both limits should be configurable. Either limit being reached triggers the drop policy. ### Congestion: Two Sides Congestion can arise at both ends of the relay and must be handled explicitly on each. **Inbound congestion (upstream → relay)** If the upstream source produces frames faster than any output can dispatch them: - Low-latency outputs are unaffected by design — they always hold at most one frame - Completeness outputs will see their queues grow; limits and drop policy absorb the excess The relay never signals backpressure to the upstream. It is the upstream's concern to produce frames at a sustainable rate; the relay's concern is only to handle whatever arrives without blocking. **Outbound congestion (relay → downstream transport)** If the transport layer cannot accept a frame immediately: - Low-latency mode: the pending frame is dropped when the next frame arrives; the transport sends the newest frame it can when it becomes ready - Completeness mode: the frame stays in the queue; the queue grows until the transport catches up or limits are reached The interaction between outbound congestion and the byte budget is important: a transport that is consistently slow will fill the completeness queue to its byte budget limit, at which point the drop policy engages. This is the intended safety valve — the budget defines the maximum acceptable latency inflation before the system reverts to dropping. ### Congestion Signals Even though the relay does not apply backpressure, it should emit **observable congestion signals** — drop counts, queue depth, byte utilization — on the control plane so that the controller can make decisions: reduce upstream quality, reroute, alert, or adjust budgets dynamically. ### Multi-Input Scheduling When a relay has multiple input sources feeding the same output, it needs a policy for which source's frame to forward next when the link is under pressure or when frames from multiple sources are ready simultaneously. This policy is the **scheduler**. The scheduler is a separate concern from delivery mode (low-latency vs completeness) — delivery mode governs buffering and drop behaviour per output; the scheduler governs which input is served when multiple compete. Candidate policies (not exhaustive — the design should keep the scheduler pluggable): | Policy | Behaviour | |---|---| | **Strict priority** | Always prefer the highest-priority source; lower-priority sources are only forwarded when no higher-priority frame is pending | | **Round-robin** | Cycle evenly across all active inputs — one frame from each in turn | | **Weighted round-robin** | Each input has a weight; forwarding interleaves at the given ratio (e.g. 1:3 means one frame from source A per three from source B) | | **Deficit round-robin** | Byte-fair rather than frame-fair variant of weighted round-robin; useful when sources have very different frame sizes | | **Source suppression** | A congested or degraded link simply stops forwarding from a given input entirely until conditions improve | Priority remains a property of the path (set at connection time). The scheduler uses those priorities plus runtime state (queue depths, drop rates) to make per-frame decisions. The `relay` module should expose a scheduler interface so policies are interchangeable without touching routing logic. Which policies to implement first is an open question — see [Open Questions](#open-questions). ```mermaid graph TD UP1[Upstream Source A] -->|encapsulated stream| RELAY[Relay] UP2[Upstream Source B] -->|encapsulated stream| RELAY RELAY --> LS[Low-latency Output
single-slot
drop on collision] RELAY --> CS[Completeness Output
queued
drop on budget exceeded] RELAY --> OB[Opaque Output
byte pipe
no frame awareness] LS -->|encapsulated| LC[Low-latency Consumer
eg. preview display] CS -->|encapsulated| CC[Completeness Consumer
eg. archiver] OB -->|opaque| RAW[Raw Consumer
eg. disk writer] RELAY -.->|drop count
queue depth
byte utilization| CTRL[Controller node] ``` --- ## Codec Module A `codec` module provides per-frame encode and decode operations for pixel data. It sits between raw pixel buffers and the transport — sources call encode before sending, sinks call decode after receiving. The relay and transport layers never need to understand pixel formats; they carry opaque payloads. ### Stream Metadata Receivers must know what format a frame payload is in before they can decode it. This is communicated once at stream setup via a `stream_open` control message rather than tagging every frame header. The message carries three fields: **`format` (u16)** — the wire format of the payload bytes; determines how the receiver decodes the frame: | Value | Format | |---|---| | `0x0001` | MJPEG | | `0x0002` | H.264 | | `0x0003` | H.265 / HEVC | | `0x0004` | AV1 | | `0x0005` | FFV1 | | `0x0006` | ProRes | | `0x0007` | QOI | | `0x0008` | Raw pixels (see `pixel_format`) | | `0x0009` | Raw pixels + ZSTD (see `pixel_format`) | **`pixel_format` (u16)** — pixel layout for raw formats; zero and ignored for compressed formats: | Value | Layout | |---|---| | `0x0001` | BGRA 8:8:8:8 | | `0x0002` | RGBA 8:8:8:8 | | `0x0003` | BGR 8:8:8 | | `0x0004` | YUV 4:2:0 planar | | `0x0005` | YUV 4:2:2 packed | **`origin` (u16)** — how the frame was produced; informational only, does not affect decoding; useful for diagnostics, quality inference, and routing decisions: | Value | Origin | |---|---| | `0x0001` | Device native — camera or capture card encoded it directly | | `0x0002` | libjpeg-turbo | | `0x0003` | ffmpeg (libavcodec) | | `0x0004` | ffmpeg (subprocess) | | `0x0005` | VA-API direct | | `0x0006` | NVENC direct | | `0x0007` | Software (other) | A V4L2 camera outputting MJPEG has `format=MJPEG, origin=device_native`. The same format re-encoded in process has `format=MJPEG, origin=libjpeg-turbo`. The receiver decodes both identically; the distinction is available for logging and diagnostics without polluting the format identifier. ### Format Negotiation When a source node opens a stream channel it sends a `stream_open` control message that includes the codec identifier. The receiver can reject the codec if it has no decoder for it. This keeps codec knowledge at the edges — relay nodes are unaffected. ### libjpeg-turbo JPEG is the natural first codec: libjpeg-turbo provides SIMD-accelerated encode on both x86 and ARM, the output format is identical to what V4L2 cameras already produce (so the ingest and archive paths treat them the same), and it is universally decodable including in browsers via `` or `createImageBitmap`. Lossy, but quality is configurable. ### QOI QOI (Quite OK Image Format) is a strong candidate for lossless screen grabs: it encodes and decodes in a single pass with no external dependencies, performs well on content with large uniform regions (UIs, text, diagrams), and the reference implementation is a single `.h` file. Output is larger than JPEG but decode is simpler and there is no quality loss. Worth benchmarking against JPEG at high quality settings for screen content. ### ZSTD over Raw Pixels ZSTD at compression level 1 is extremely fast and can achieve meaningful ratios on screen content (which tends to be repetitive). No pixel format conversion is needed — capture raw, compress raw, decompress raw, display raw. This avoids any colour space or chroma subsampling decisions and is entirely lossless. The downside is that even compressed, the payload is larger than JPEG for photographic content; for UI-heavy screens it can be competitive. ### VA-API (Hardware H.264 Intra) Intra-only H.264 via VA-API gives very high compression with GPU offload. This is the most complex option to set up and introduces a GPU dependency, but may be worthwhile for high-resolution grabs over constrained links. Deferred until simpler codecs are validated. ### ffmpeg Backend ffmpeg (via libavcodec or subprocess) is a practical escape hatch that gives access to a large number of codecs, container formats, and hardware acceleration paths without implementing them from scratch. It is particularly useful for archival formats where the encode latency of a more complex codec is acceptable. **Integration options:** - **libavcodec** — link directly against the library; programmatic API, tight integration, same process; introduces a large build dependency but gives full control over codec parameters and hardware acceleration (NVENC, VA-API, VideoToolbox, etc.) - **subprocess pipe** — spawn `ffmpeg`, pipe raw frames to stdin, read encoded output from stdout; simpler, no build dependency, more isolated from the rest of the node process; latency is higher due to process overhead but acceptable for archival paths where real-time delivery is not required The subprocess approach fits naturally into the completeness output path of the relay: frames arrive in order, there is no real-time drop pressure, and the ffmpeg process can be restarted independently if it crashes without taking down the node. libavcodec is the better fit for low-latency encoding (e.g. screen grab over a constrained link). **Archival formats of interest:** | Format | Notes | |---|---| | H.265 / HEVC | ~50% better compression than H.264 at same quality; NVENC and VA-API hardware support widely available | | AV1 | Best open-format compression; software encode is slow, hardware encode (AV1 NVENC on RTX 30+) is fast | | FFV1 | Lossless, designed for archival; good compression for video content; the format used by film archives | | ProRes | Near-lossless, widely accepted in post-production toolchains; large files but easy to edit downstream | The encoder backend is recorded in the `origin` field of `stream_open` — the receiver cares only about `format`, not how the bytes were produced. Switching from a subprocess encode to libavcodec, or from software to hardware, requires no protocol change. --- ## X11 / Xorg Integration An `xorg` module provides two capabilities that complement the V4L2 camera pipeline: screen geometry queries and an X11-based video feed viewer. Both operate as first-class node roles. ### Screen Geometry Queries (XRandR) Using the XRandR extension, the module can enumerate connected outputs and retrieve their geometry — resolution, position within the desktop coordinate space, physical size, and refresh rate. This is useful for: - **Routing decisions**: knowing the resolution of the target display before deciding how to scale or crop an incoming stream - **Screen grab source**: determining the exact rectangle to capture for a given monitor - **Multi-monitor layouts**: placing viewer windows correctly in a multi-head setup without guessing offsets Queries are exposed as control request/response pairs on the standard transport, so a remote node can ask "what monitors does this machine have?" and receive structured geometry data without any X11 code on the asking side. ### Screen Grab Source The module can act as a video source by capturing the contents of a screen region using `XShmGetImage` (MIT-SHM extension) for zero-copy capture within the same machine. The captured region is a configurable rectangle — typically one full monitor by its XRandR geometry, but can be any sub-region. Raw captured pixels are uncompressed — 1920×1080 at 32 bpp is ~8 MB per frame. Before the frame enters the transport it must be encoded. The grab loop calls the `codec` module to compress each frame, then encapsulates the result. The codec is configured per stream; see [Codec Module](#codec-module). The grab loop produces frames at a configured rate, encapsulates them, and feeds them into the transport like any other video source. Combined with geometry queries, a remote controller can enumerate monitors, select one, and start a screen grab stream without manual coordinate configuration. ### Frame Viewer Sink The module can act as a video sink by creating a window and rendering the latest received frame into it. The window: - Geometry (size and monitor placement) is specified at stream open time, using XRandR data when targeting a specific output - Can be made fullscreen on a chosen output - Displays the most recently received frame — driven by the low-latency output mode of the relay; never buffers for completeness - Forwards keyboard and mouse events back upstream as `INPUT_EVENT` protocol messages, enabling remote control use cases Scale and crop are applied in the renderer. Four display modes are supported (selected per viewer): | Mode | Behaviour | |---|---| | `STRETCH` | Fill window, ignore aspect ratio | | `FIT` | Largest rect that fits, preserve aspect, black bars | | `FILL` | Scale to cover, preserve aspect, crop edges | | `1:1` | Native pixel size, no scaling; excess cropped | Each mode combines with an anchor (`CENTER` or `TOP_LEFT`) that controls placement when the frame does not fill the window exactly. This allows a high-resolution source (Pi camera, screen grab) to be displayed scaled-down on a different machine, or viewed at native resolution with panning. This makes it the display-side counterpart of the V4L2 capture source: a frame grabbed from a camera on a Pi can be viewed on any machine in the network running a viewer sink node, with the relay handling the path and delivery mode. #### Renderer: GLFW + OpenGL The initial implementation uses **GLFW** for window and input management and **OpenGL** for rendering. GLFW handles window creation, the event loop, resize, and input callbacks — it also supports Vulkan surface creation using the same API, which makes a future renderer swap straightforward. Input events (keyboard, mouse) are normalised by GLFW before being encoded as protocol messages. The OpenGL renderer: 1. For **MJPEG**: calls `tjDecompressToYUVPlanes` (libjpeg-turbo) to decompress directly to planar YUV — no CPU-side color conversion. JPEG stores YCbCr internally so this is the minimal decode path: Huffman + DCT output lands directly in YUV planes. 2. Uploads Y, Cb, Cr as separate `GL_RED` textures (chroma at half resolution for 4:2:0 / 4:2:2 as delivered by most V4L2 cameras). 3. Fragment shader samples the three planes and applies the BT.601 matrix to produce RGB — a few lines of GLSL. 4. Scaling and filtering happen in the same shader pass. 5. Presents via GLFW's swap-buffers call. For **raw pixel formats** (BGRA, YUV planar from the wire): uploaded directly without decode; shader handles any necessary swizzle or conversion. This keeps CPU load minimal — the only CPU work for MJPEG is Huffman decode and DCT, which libjpeg-turbo runs with SIMD. All color conversion and scaling is on the GPU. #### Text overlays Two tiers, implemented in order: **Tier 1 — bitmap font atlas (done)** `tools/gen_font_atlas/gen_font_atlas.py` (Python/Pillow) renders glyphs 32–255 from DejaVu Sans at 16pt into a packed grayscale atlas using a skyline bin packer and emits `build/gen/font_atlas.h` — a C header with the pixel data as a `static const uint8_t` array and a `Font_Glyph[256]` metrics table indexed by codepoint. At runtime the atlas is uploaded as a `GL_R8` texture. Each overlay is rendered as a batch of alpha-blended glyph quads preceded by a semi-transparent dark background rect (using a separate minimal screen-space rect shader driven by `gl_VertexID`). The public API is `xorg_viewer_set_overlay_text(v, idx, x, y, text, r, g, b)` and `xorg_viewer_clear_overlays(v)`. Up to 8 independent overlays are supported. The generator runs automatically as a `make` dependency before compiling `xorg.c`. The Pillow build tool is the only Python dependency; there are no runtime font deps. **Tier 2 — HarfBuzz + FreeType (future)** A proper runtime font stack for full typography: correct shaping, kerning, ligatures, bidirectional text, non-Latin scripts. Added as a feature flag with its own runtime deps alongside the blit path. When Tier 2 is implemented, the Pillow build dependency may be replaced by a purpose-built atlas generator (removing the Python dep entirely), if the blit path is still useful alongside the full shaping path. #### Render loop The viewer is driven by incoming frames rather than a fixed-rate loop. Two polling functions are provided depending on the use case: **Static image / test tool** — `xorg_viewer_poll(v)` processes events then re-renders from existing textures: ```c while (xorg_viewer_poll(v)) { /* wait for close */ } ``` **Live stream** — the push functions (`push_yuv420`, `push_mjpeg`, etc.) already upload and render. Use `xorg_viewer_handle_events(v)` to process window events without an extra render: ```c while (1) { /* block on V4L2/network fd until frame or timeout */ if (frame_available) { xorg_viewer_push_mjpeg(v, data, size); /* upload + render */ } if (!xorg_viewer_handle_events(v)) { break; } } ``` A `framebuffer_size_callback` registered on the window calls `render()` synchronously during resize, so the image tracks the window edge without a one-frame lag. Threading note: the GL context must be used from the thread that created it. In the video node, incoming frames arrive on a network receive thread. A frame queue between the receive thread and the render thread (which owns the GL context) is the correct model — the render thread drains the queue each poll iteration rather than having the network thread call push functions directly. #### Renderer: Vulkan (future alternative) A Vulkan renderer is planned as an alternative to the OpenGL one. GLFW's surface creation API is renderer-agnostic, so the window management and input handling code is shared. Only the renderer backend changes. Vulkan offers more explicit control over presentation timing, multi-queue workloads, and compute shaders (e.g. on-GPU MJPEG decode via a compute pass if a suitable library is available). It is not needed for the initial viewer but worth having for high-frame-rate or multi-stream display scenarios. The renderer selection should be a compile-time or runtime option — both implementations conform to the same internal interface (`render_frame(pixel_buffer, width, height, format)`). --- ## Audio (Future) Audio streams are not in scope for the initial implementation but the transport is designed to accommodate them without structural changes. A future audio stream is just another message type on an existing transport connection — no new connection type or header field is needed. `stream_id` in the payload already handles multiplexing. The message type table has room for an `audio_frame` type alongside `video_frame`. The main open question is codec and container: raw PCM is trivial to handle but large; compressed formats (Opus, AAC) need framing conventions. This is deferred until video is solid. The frame allocator, relay, and archive modules should not assume that a frame implies video — they operate on opaque byte payloads with a message type and length, so audio frames will pass through the same infrastructure unchanged. --- ## Device Resilience Nodes that read from hardware devices (V4L2 cameras, media devices) must handle transient device loss — a USB camera that disconnects and reconnects, a device node that briefly disappears during a mode switch, or a stream that errors out and can be retried. This is not an early implementation concern but has structural implications that should be respected from the start. ### The Problem by Layer **Source node / device reader** A device is opened by fd. On a transient disconnect, the fd becomes invalid — reads return errors or short counts. The device may reappear under the same path after some time. Recovery requires closing the bad fd, waiting or polling for the device to reappear, reopening, and restarting the capture loop. Any state tied to the old fd (ioctl configuration, stream-on status) must be re-established. **Opaque stream edge** The downstream receiver sees bytes stop. There is no mechanism in an opaque stream to distinguish "slow source", "dead source", or "recovered source". A reconnection produces a new byte stream that appears continuous to the receiver — but contains a hard discontinuity. The receiver has no way to know it should reset state. This is a known limitation of opaque mode. If the downstream consumer is sensitive to stream discontinuities (e.g. a frame parser), it must use encapsulated mode on that edge. **Encapsulated stream edge** The source node sends a `stream_event` message (`0x0004`) on the affected `channel_id` before the bytes stop (if possible) or as the first message when stream resumes. The payload carries an event code: | Code | Meaning | |---|---| | `0x01` | Stream interrupted — device lost, bytes will stop | | `0x02` | Stream resumed — device recovered, frames will follow | On receiving `stream_interrupted`, downstream nodes know to discard any partial frame being assembled and reset parser state. On `stream_resumed`, they know a clean frame boundary follows and can restart cleanly. **Ingest module (MJPEG parser)** The two-pass EOI state machine is stateful per stream. It must expose an explicit reset operation that discards any partial frame in progress and returns the parser to a clean initial state. This reset is triggered by a `stream_interrupted` event, or by any read error from the device. Any frame allocation begun for the discarded partial frame must be released before the reset completes. **Frame allocator** A partial frame that was being assembled when the device dropped must be explicitly abandoned. The allocator must support an `abandon` operation distinct from a normal `release` — abandon means the allocation is invalid and any reference tracking for it should be unwound immediately. This prevents a partial allocation from sitting in the accounting tables and consuming budget. ### Source Node Recovery Loop The general structure for a resilient device reader (not yet implemented, for design awareness): 1. Open device, configure, start capture 2. On read error: emit `stream_interrupted` on the transport, close fd, enter retry loop 3. Poll for device reappearance (inotify on `/dev`, or timed retry) 4. On device back: reopen, reconfigure (ioctl state is lost), emit `stream_resumed`, resume capture 5. Log reconnection events to the control plane as observable signals The retry loop must be bounded — a device that never returns should eventually cause the node to report a permanent failure rather than loop indefinitely. ### Implications for Opaque Streams If a source node is producing an opaque stream and the device drops, the TCP connection itself may remain open while bytes stop flowing. The downstream node only learns something is wrong via a timeout or its own read error. For this reason, **opaque streams should only be used on edges where the downstream consumer either does not care about discontinuities or has its own out-of-band mechanism to detect them**. Edges into an ingest node must use encapsulated mode. --- ## Implementation Approach The system is built module by module in C11. Each translation unit is developed and validated independently before being integrated. See [planning.md](planning.md) for current status and module order, and [conventions.md](conventions.md) for code and project conventions. The final deliverable is a single configurable node binary. During development, each module is exercised through small driver programs that live in the development tree, not in the module directories. --- ## Protocol Serialization Control message payloads use a compact binary format. The wire encoding is **little-endian** throughout — all target platforms (Raspberry Pi ARM, x86 laptop) are little-endian, and little-endian is the convention of most modern protocols (USB, Bluetooth LE, etc.). ### Serialization Layer A `serial` module provides the primitive read/write operations on byte buffers: - `put_u8`, `put_u16`, `put_u32`, `put_i32`, `put_u64` — write a value at a position in a buffer - `get_u8`, `get_u16`, `get_u32`, `get_i32`, `get_u64` — read a value from a position in a buffer These are pure buffer operations with no I/O. Fields are never written by casting a struct to bytes — each field is placed explicitly, which eliminates struct padding and alignment assumptions. ### Protocol Layer A `protocol` module builds on `serial` and the transport to provide typed message functions: ```c write_v4l2_set_control(stream, id, value); write_v4l2_get_control(stream, id); write_v4l2_enumerate_controls(stream); ``` Each `write_*` function knows the exact wire layout of its message, packs the full frame (header + payload) into a stack buffer using `put_*`, then issues a single write to the stream. The corresponding `read_*` functions unpack responses using `get_*`. This gives a clean two-layer separation: `serial` handles byte layout, `protocol` handles message semantics and I/O. ### Web Interface as a Protocol Peer The web interface (Node.js/Express) participates in the graph as a first-class protocol peer — it speaks the same binary protocol as any C node. There is no JSON bridge or special C code to serve the web layer. The boundary is: - **Socket side**: binary protocol, framed messages, little-endian fields read with `DataView` (`dataView.getUint32(offset, true)` maps directly to `get_u32`) - **Browser side**: HTTP/WebSocket, JSON, standard web APIs A `protocol.mjs` module in the web layer mirrors the C `protocol` module — same message types, same wire layout, different language. This lets the web interface connect to any video node, send control requests (V4L2 enumeration, parameter get/set, device discovery), and receive structured responses. Treating the web node as a peer also means it exercises the real protocol, which surfaces bugs that a JSON bridge would hide. ### Future: Single Source of Truth via Preprocessor The C `protocol` module and the JavaScript `protocol.mjs` currently encode the same wire format in two languages. This duplication is a drift risk — a change to a message layout must be applied in both places. A future preprocessor will eliminate this. Protocol messages will be defined once in a language-agnostic schema, and the preprocessor will emit both: - C source — `put_*`/`get_*` calls, struct definitions, `write_*`/`read_*` functions - ESM JavaScript — `DataView`-based encode/decode, typed constants The preprocessor is the same tool planned for generating error location codes (see `common/error`). The protocol schema becomes a single source of truth, and both the C and JavaScript implementations are derived artifacts. --- ## Node Discovery Standard mDNS (RFC 6762) uses UDP multicast over `224.0.0.251:5353` with DNS-SD service records. The wire protocol is well-defined and the multicast group is already in active use on most LANs. The standard service discovery stack (Avahi, Bonjour, `nss-mdns`) provides that transport but brings significant overhead: persistent daemons, D-Bus dependencies, complex configuration surface, and substantial resident memory. None of that is needed here. The approach: **reuse the multicast transport, define our own wire format**. Rather than DNS wire format, node announcements are encoded as binary frames using the same serialization layer (`serial`) and frame header used for video transport. A node joins the multicast group, broadcasts periodic announcements, and listens for announcements from peers. ### Announcement Frame | Field | Size | Purpose | |---|---|---| | `message_type` | 2 bytes | Discovery message type (e.g. `0x0010` for node announcement) | | `channel_id` | 2 bytes | Reserved / zero | | `payload_length` | 4 bytes | Byte length of payload | | Payload | variable | Encoded node identity and capabilities | Payload fields: | Field | Type | Purpose | |---|---|---| | `protocol_version` | u8 | Wire format version | | `site_id` | u16 | Site this node belongs to (`0` = local / unassigned) | | `tcp_port` | u16 | Port where this node accepts transport connections | | `function_flags` | u16 | Bitfield declaring node capabilities (see below) | | `name_len` | u8 | Length of name string | | `name` | bytes | Node name (`namespace:instance`, e.g. `v4l2:microscope`) | `function_flags` bits: | Bit | Mask | Meaning | |---|---|---| | 0 | `0x0001` | Source — produces video | | 1 | `0x0002` | Relay — receives and distributes streams | | 2 | `0x0004` | Sink — consumes video (display, archiver, etc.) | | 3 | `0x0008` | Controller — participates in control plane coordination | A node may set multiple bits — a relay that also archives sets both `RELAY` and `SINK`. ### Behaviour - Nodes send announcements periodically (e.g. every 5 s) and immediately on startup - No daemon — the node process itself sends and listens; no background service required - On receiving an announcement, the control plane records the peer (address, port, name, function) and can initiate a transport connection if needed - A node going silent for a configured number of announcement intervals is considered offline - Announcements are informational only — the hub validates identity at connection time ### No Avahi/Bonjour Dependency The system does not link against, depend on, or interact with Avahi or Bonjour. It opens a raw UDP multicast socket directly, which requires only standard POSIX socket APIs. This keeps the runtime dependency footprint minimal and the behaviour predictable. --- ## Multi-Site (Forward Compatibility) The immediate use case is a single LAN. A planned future use case is **site-to-site linking** — two independent networks (e.g. a lab and a remote location) connected by a tunnel (SSH port-forward, WireGuard, etc.), where nodes on both sites are reachable from either side. ### Site Identity Every node carries a `site_id` (`u16`) in its announcement. In a single-site deployment this is always `0`. When sites are joined, each site is assigned a distinct non-zero ID; nodes retain their IDs across the join and are fully addressable by `(site_id, name)` from anywhere in the combined network. This field is reserved from day one so that multi-site never requires a wire format change or a rename of existing identifiers. ### Site Gateway Node A site gateway is a node that participates in both networks simultaneously — it has a connection on the local transport and a connection over the inter-site tunnel. It: - Bridges discovery announcements between sites (rewriting `site_id` appropriately) - Forwards encapsulated transport frames across the tunnel on behalf of cross-site edges - Is itself a named node, so the control plane can see and reason about it The tunnel transport is out of scope for now. The gateway is a node type, not a special infrastructure component — it uses the same wire protocol as everything else. ### Site ID Translation Both sides of a site-to-site link will independently default to `site_id = 0`. A gateway cannot simply forward announcements across the boundary — every node on both sides would appear as site 0 and be indistinguishable. The gateway is responsible for **site ID translation**: it assigns a distinct non-zero `site_id` to each side of the link and rewrites the `site_id` field in all announcements and any protocol messages that carry a `site_id` as they cross the boundary. From each side's perspective, remote nodes appear with the translated ID assigned by the gateway; local nodes retain their own IDs. This means `site_id = 0` should be treated as "local / unassigned" and never forwarded across a site boundary without translation. A node that receives an announcement with `site_id = 0` on a cross-site link should treat it as a protocol error from the gateway. ### Addressing A fully-qualified node address is `site_id:namespace:instance`. Within a single site, `site_id` is implicit and can be omitted. The control plane and discovery layer must store `site_id` alongside every peer record from the start, even if it is always `0`, so that the upgrade to multi-site addressing requires only configuration and a gateway node — not code changes. --- ## Node State Model ### Wanted vs Current State Each node maintains two independent views of its configuration: **Wanted state** — the declared intent for this node. Set by the controller via protocol commands and persisted independently of whether the underlying resources are actually running. Examples: "ingest /dev/video0 as stream 3, send to 192.168.1.2:8001", "display stream 3 in a window". Wanted state survives connection drops, device loss, and restarts — it represents what the node *should* be doing. **Current state** — what the node is actually doing right now. Derived from which file descriptors are open, which transport connections are established, which processes are running. Changes as resources are acquired or released. The controller queries both views to construct the graph. Wanted state gives the topology (what is configured). Current state gives the runtime overlay (what is live, with stats). This separation means the web UI can show an edge as grey (configured but not connected), green (connected and streaming), or red (configured but failed) without any special-casing — the difference is just whether wanted and current state agree. ### Reconciler A generic reconciler closes the gap between wanted and current state. It is invoked: - **On event** — transport disconnect, device error, process exit, `STREAM_OPEN` received: fast response to state changes - **On periodic tick** — safety net; catches external failures that produce no callback (e.g. a device that silently disappears and reappears) The reconciler does not know what a "stream" or a "V4L2 device" is. It operates on abstract state machines, each representing one resource. Resources declare their states, transitions, and dependencies; the reconciler finds the path from current to wanted state and executes the transitions in order. ### Resource State Machines Each managed resource is described as a directed graph: - **Nodes** are states (e.g. `CLOSED`, `OPEN`, `STREAMING`) - **Edges** are transitions with associated actions (e.g. `open_fd`, `close_fd`, `connect_transport`, `spawn_process`) - **Dependencies** between resources constrain ordering (e.g. transport connection requires device open) The state graphs are small and defined at compile time. Pathfinding is BFS — with 3–5 states per resource the cost is negligible. The benefit is that adding a new resource type (e.g. an ffmpeg subprocess for codec work) requires only defining its state graph and declaring its dependencies; the reconciler's core logic is unchanged. **Example resource state graphs:** V4L2 capture device: ``` CLOSED → OPEN → STREAMING ``` Transitions: `open_fd` (CLOSED→OPEN), `start_capture` (OPEN→STREAMING), `stop_capture` (STREAMING→OPEN), `close_fd` (OPEN→CLOSED). Outbound transport connection: ``` DISCONNECTED → CONNECTING → CONNECTED ``` Transitions: `connect` (DISCONNECTED→CONNECTING), `connected_cb` (CONNECTING→CONNECTED), `disconnect` (CONNECTED→DISCONNECTED). External codec process: ``` STOPPED → STARTING → RUNNING ``` Transitions: `spawn` (STOPPED→STARTING), `ready_cb` (STARTING→RUNNING), `kill` (RUNNING→STOPPED). Dependency example: "outbound transport connection" requires "V4L2 device open". The reconciler will not attempt to connect the transport until the device is in state `OPEN` or `STREAMING`. ### Generic Implementation The reconciler is implemented as a standalone module (`reconciler`) that is not specific to video. It operates on: ```c typedef struct { int state_count; int current_state; int wanted_state; /* transition table: [from][to] → action fn + dependency list */ } Resource; ``` This makes it reusable across any node component in the project — not just video ingest. The video node registers its resources (device, transport connection, display sink) and their dependencies, then calls `reconciler_tick()` on events and periodically. ### Node State Queries Two protocol commands allow the controller to query a node's state: **`GET_CONFIG_STATE`** — returns the wanted state: which streams the node is configured to produce or consume, their destinations/sources, format, stream ID. This is the topology view — what is configured regardless of whether it is currently active. **`GET_RUNTIME_STATE`** — returns the current state: which resources are in which state, live fps/mbps per stream (from `stream_stats`), error codes for any failed resources. The controller queries all discovered nodes, correlates streams by ID and peer address, and reconstructs the full graph from the union of responses. No central authority is needed — the graph emerges from the node state reports. ### Stream ID Assignment Stream IDs are assigned by the controller, not by individual nodes. This ensures that when node A reports "I am sending stream 3 to B" and node B reports "I am receiving stream 3 from A", the IDs match and the edge can be reconstructed. Each `START_INGEST` or `START_SINK` command from the controller includes the stream ID to use. ### Connection Direction The source node connects outbound to the sink's transport server port. A single TCP port per node is the default — all traffic (video frames, control messages, state queries) flows through it in both directions. Dedicated per-stream ports on separate listening sockets are a future option for high-bandwidth links and must be represented in state reporting so the graph reconstructs correctly regardless of which port a connection uses. --- ## Open Questions - What is the graph's representation format — in-memory object graph, serialized config, or both? - How are connections established — does the controller push connection instructions to nodes, or do nodes pull from a known address? - Drop policy for completeness queues: drop oldest (recency) or drop newest (continuity)? Should be per-output configurable. - When a relay has multiple inputs on an encapsulated transport, how are streams tagged on the outbound side — same stream_id passthrough, or remapped? - What transport is used for relay edges — TCP, UDP, shared memory for local hops? - Should per-output byte budgets be hard limits or soft limits with hysteresis? - Which relay scheduler policies should be implemented first, and what is the right interface for plugging in a custom policy? Minimum viable set is probably strict priority + weighted round-robin.