# Video Routing System — Architecture
## Concept
A graph-based multi-peer video routing system where nodes are media processes and edges are transport connections. The graph carries video streams between sources, relay nodes, and sinks, with **priority** as a first-class property on paths — so that a low-latency monitoring feed and a high-quality archival feed can coexist and be treated differently by the system.
---
## Documentation
- [Transport Protocol and Serialization](docs/transport.md) — wire modes, frame header, message types, protocol serialization layer, web peer
- [Relay Design](docs/relay.md) — delivery modes, memory model, congestion handling, multi-input scheduling
- [Codec Module](docs/codec.md) — stream metadata fields, format negotiation, codec backends
- [X11 / Xorg Integration](docs/xorg.md) — screen geometry, screen grab source, frame viewer sink, renderers
- [Node Discovery and Multi-Site](docs/discovery.md) — multicast announcement format, behaviour, site identity and gateways
- [Node State Model](docs/node-state.md) — wanted vs current state, reconciler, resource state machines, stream stats, state queries, stream ID assignment, connection direction
- [Device Resilience](docs/device-resilience.md) — transient device loss handling, stream events, recovery loop, audio (future)
- [Protocol Reference](docs/protocol.md) — full message payload schemas
---
## Design Rationale
### Declarative, Not Imperative
The control model is **declarative**: the controller sets *wanted state* on each node ("you should be ingesting /dev/video0 and sending stream 3 to node B"), and each node is responsible for reconciling its current state toward that goal autonomously. The controller does not issue step-by-step commands like "open device", "connect to peer", "start sending".
This is a deliberate architectural decision. Imperative orchestration — where the controller drives each resource transition directly — is fragile: the controller must track the state of every remote resource, handle every failure sequence, and re-issue commands when things go wrong. Declarative orchestration pushes that responsibility to the node, which is the only place with direct access to its own resources and the ability to respond to local failures (device disconnect, transport drop, process crash) without round-tripping through the controller.
The practical effect: the controller writes wanted state and the node's reconciler does the rest. The controller can query both wanted and current state at any time to understand the topology and health of the network — see [Node State Model](docs/node-state.md).
### Get It on the Wire First
A key principle driving the architecture is that **capture devices should not be burdened with processing**.
A Raspberry Pi attached to a camera (V4L2 source) is capable of pulling raw or MJPEG frames off the device, but it is likely too resource-constrained to also transcode, mux, or perform any non-trivial stream manipulation. Doing so would add latency and compete with the capture process itself.
The preferred model is:
1. **Pi captures and transmits raw** — reads frames directly from V4L2 (MJPEG or raw Bayer/YUV) and puts them on the wire over TCP as fast as possible, with no local transcoding
2. **A more capable machine receives and defines the stream** — a downstream node with proper CPU/GPU resources receives the raw feed and produces well-formed, containerized, or re-encoded output appropriate for the intended consumers (display, archive, relay)
This separation means the Pi's job is purely ingestion and forwarding. It keeps the capture loop tight and latency minimal. The downstream node then becomes the "source" of record for the rest of the graph.
This is also why the V4L2 remote control protocol is useful — the Pi doesn't need to run any control logic locally. It exposes its camera parameters over TCP, and the controlling machine adjusts exposure, white balance, codec settings, etc. remotely. The Pi just acts on the commands.
---
## Graph Model
### Nodes
Each node is a named process instance, identified by a namespace and name (e.g. `v4l2:microscope`, `xorg:preview`, `archiver:main`).
Node types:
| Type | Role |
|---|---|
| **Source** | Produces video — V4L2 camera, screen grab, file, test signal |
| **Relay** | Receives one or more input streams and distributes to one or more outputs, each with its own delivery mode and buffer; never blocks upstream |
| **Sink** | Consumes video — display window, archiver, encoder output |
A relay with multiple inputs is what would traditionally be called a mux — it combines streams from several sources and forwards them, possibly over a single transport. The dispatch and buffering logic is the same regardless of input count.
### Edges
An edge is a transport connection between two nodes. Edges carry:
- The video stream itself (TCP, pipe, or other transport)
- A **priority** value
- A **transport mode** — opaque or encapsulated (see [Transport Protocol](docs/transport.md))
### Priority
Priority governs how the system allocates resources and makes trade-offs when paths compete:
- **High priority (low latency)** — frames are forwarded immediately; buffering is minimized; if a downstream node is slow it gets dropped frames, not delayed ones; quality may be lower
- **Low priority (archival)** — frames may be buffered, quality should be maximized; latency is acceptable; dropped frames are undesirable
Priority is a property of the *path*, not of the source. The same source can feed a high-priority monitoring path and a low-priority archival path simultaneously.
---
## Control Plane
There is no central hub or broker. Nodes communicate directly with each other over the binary transport. Any node can hold the **controller role** (`function_flags` bit 3) — this means it has a user-facing interface (such as the web UI) through which the user can inspect the network, load a topology configuration, and establish or tear down connections between nodes.
The controller role is a capability, not a singleton. Multiple nodes could hold it simultaneously; which one a user interacts with is a matter of which they connect to. A node that is purely a source or relay with no UI holds no controller bits.
The practical flow is: a user starts a node with the controller role, discovers the other nodes on the network via the multicast announcement layer, and uses the interface to configure how streams are routed between them. The controller writes wanted state to the relevant peers over the binary protocol — each peer then reconciles its own resources autonomously. There is no intermediary and no imperative step-by-step orchestration.
The first controller interface is a CLI tool (`controller_cli`), which exercises the same protocol that the eventual web UI will use. The web UI is a later addition — the protocol and node behaviour are identical either way.
V4L2 device control and enumeration are carried as control messages within the encapsulated transport on the same connection as video — see [Transport Protocol](docs/transport.md).
---
## Ingestion Pipeline (Raspberry Pi Example)
```mermaid
graph LR
CAM[V4L2 Camera
dev/video0] -->|raw MJPEG| PI[Pi: ingest node]
PI -->|encapsulated stream| RELAY[Relay]
RELAY -->|high priority| DISPLAY[Display / Preview
low latency]
RELAY -->|low priority| ARCHIVE[Archiver
high quality]
CTRL[Controller node
CLI or web UI] -.->|V4L2 control
via transport| PI
CTRL -.->|wanted state| RELAY
```
The Pi runs a node process that dequeues V4L2 buffers and forwards each buffer as an encapsulated frame over TCP. It also exposes the V4L2 control endpoint for remote parameter adjustment.
Everything else happens on machines with adequate resources.
### V4L2 Buffer Dequeuing
When a V4L2 device is configured for `V4L2_PIX_FMT_MJPEG`, the driver delivers one complete MJPEG frame per dequeued buffer — frame boundaries are guaranteed at the source. The ingest module dequeues these buffers and emits each one as an encapsulated frame directly into the transport. No scanning or frame boundary detection is needed.
This is the primary capture path. It is clean, well-defined, and relies on standard V4L2 kernel behaviour rather than heuristics.
### Misbehaving Hardware: `mjpeg_scan` (Future)
Some hardware does not honour the per-buffer framing contract — cheap USB webcams or cameras with unusual firmware may concatenate multiple partial frames into a single buffer, or split one frame across multiple buffers. For these cases a separate optional `mjpeg_scan` module provides a fallback: it scans the incoming byte stream for JPEG SOI (`0xFF 0xD8`) and EOI (`0xFF 0xD9`) markers to recover frame boundaries heuristically.
This module is explicitly a workaround for non-compliant hardware. It is not part of the primary pipeline and will be implemented only if a specific device requires it. For sources with unusual container formats (AVI-wrapped MJPEG, HTTP multipart, RTSP with quirky packetisation), the preferred approach is to route through ffmpeg rather than write a custom parser.
---
## Implementation Approach
The system is built module by module in C11. Each translation unit is developed and validated independently before being integrated. See [planning.md](planning.md) for current status and module order, and [conventions.md](conventions.md) for code and project conventions.
The final deliverable is a single configurable node binary. During development, each module is exercised through small driver programs that live in the development tree, not in the module directories.
---
## Decided
These were previously open questions and are now resolved:
- **Connection direction**: the source node connects outbound to the sink's transport server. The controller writes wanted state to the source node including the destination host:port; the source's reconciler establishes the connection.
- **Stream ID assignment**: stream IDs are assigned by the controller, not generated locally by nodes. This ensures both ends of a stream report the same ID and the graph can be reconstructed by correlating node state reports.
- **Single port per node**: one TCP listening port handles all traffic — video frames, control messages, state queries — in both directions. Dedicated per-stream ports on separate sockets are a future option but not the default.
- **First delivery mode**: low-latency (no-buffer) mode is implemented first. No frame queue anywhere in the pipeline — V4L2 dequeue goes directly to transport send; received frames render immediately and are dropped if the display is behind.
- **Drop policy**: per-output configurable. Both drop-oldest (recency) and drop-newest (continuity) are supported; the policy is set at stream open time.
- **Stream ID remapping at relay**: no remapping — stream IDs pass through unchanged. The relay forwards frames with the same stream ID they arrived with. Site-to-site gateways may need to translate IDs at the boundary but that is a future concern handled at the gateway, not in the relay itself.
- **Transport for relay edges**: TCP only for now. UDP and shared memory (for local hops) are future considerations; the transport abstraction should accommodate them without the relay needing to care which is in use.
- **Byte budgets**: soft limits with hysteresis — two thresholds (start dropping, stop dropping) to avoid thrashing at the boundary.
- **Relay scheduler**: strict priority first. Additional policies (round-robin, weighted round-robin, deficit round-robin, source suppression) are documented in [Relay Design](docs/relay.md) and will be added later. The scheduler interface is pluggable so policies are interchangeable without touching routing logic.
- **Graph representation**: the graph lives in the web interface (ESM). No special format needed — plain objects, classes, and arrays. The web node queries all discovered peers for their wanted and runtime state, reconstructs the graph in-memory, and drives the UI from that. Future TUI/CLI controller tools reuse the same ESM libraries via Node.js. Complex graph logic (reconstruction, topology diffing, layout) is easier to maintain in ESM than in C and belongs there.
---
## Open Questions
None currently open.