Add codec module: per-frame encode/decode for screen grabs

Documents codec identification (u16 per channel, set at stream open), four initial candidates: MJPEG/libjpeg-turbo, QOI, ZSTD-raw, VA-API H.264 intra. Screen grab source calls codec before transport; relay and archive remain payload-agnostic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 22:45:18 +00:00
parent 5cea34caf5
commit 44a3326a76
2 changed files with 44 additions and 2 deletions
--- a/architecture.md
+++ b/architecture.md
@@ -242,6 +242,45 @@ graph TD

 ---

+## Codec Module
+
+A `codec` module provides per-frame encode and decode operations for pixel data. It sits between raw pixel buffers and the transport — sources call encode before sending, sinks call decode after receiving. The relay and transport layers never need to understand pixel formats; they carry opaque payloads.
+
+### Codec Identification
+
+Receivers must know what format a frame payload is in. This is communicated at stream setup time via a control message that associates a `channel_id` with a codec identifier, rather than tagging every frame header. The codec identifier is a `u16`:
+
+| Value | Codec |
+|---|---|
+| `0x0001` | MJPEG — same format as V4L2 hardware-encoded output; libjpeg-turbo on the encode side |
+| `0x0002` | QOI — lossless, single-header implementation, fast; good for screen content |
+| `0x0003` | Raw pixels + ZSTD — lossless; raw BGRA/RGBA compressed with ZSTD at a low level |
+| `0x0004` | H.264 intra — single I-frames via VA-API hardware encode; high compression, GPU required |
+
+V4L2 camera streams typically arrive pre-encoded as MJPEG from hardware; no encode step is needed on that path. The codec module is primarily used by the screen grab source.
+
+### Format Negotiation
+
+When a source node opens a stream channel it sends a `stream_open` control message that includes the codec identifier. The receiver can reject the codec if it has no decoder for it. This keeps codec knowledge at the edges — relay nodes are unaffected.
+
+### libjpeg-turbo
+
+JPEG is the natural first codec: libjpeg-turbo provides SIMD-accelerated encode on both x86 and ARM, the output format is identical to what V4L2 cameras already produce (so the ingest and archive paths treat them the same), and it is universally decodable including in browsers via `<img>` or `createImageBitmap`. Lossy, but quality is configurable.
+
+### QOI
+
+QOI (Quite OK Image Format) is a strong candidate for lossless screen grabs: it encodes and decodes in a single pass with no external dependencies, performs well on content with large uniform regions (UIs, text, diagrams), and the reference implementation is a single `.h` file. Output is larger than JPEG but decode is simpler and there is no quality loss. Worth benchmarking against JPEG at high quality settings for screen content.
+
+### ZSTD over Raw Pixels
+
+ZSTD at compression level 1 is extremely fast and can achieve meaningful ratios on screen content (which tends to be repetitive). No pixel format conversion is needed — capture raw, compress raw, decompress raw, display raw. This avoids any colour space or chroma subsampling decisions and is entirely lossless. The downside is that even compressed, the payload is larger than JPEG for photographic content; for UI-heavy screens it can be competitive.
+
+### VA-API (Hardware H.264 Intra)
+
+Intra-only H.264 via VA-API gives very high compression with GPU offload. This is the most complex option to set up and introduces a GPU dependency, but may be worthwhile for high-resolution grabs over constrained links. Deferred until simpler codecs are validated.
+
+---
+
 ## X11 / Xorg Integration

 An `xorg` module provides two capabilities that complement the V4L2 camera pipeline: screen geometry queries and an X11-based video feed viewer. Both operate as first-class node roles.
@@ -260,6 +299,8 @@ Queries are exposed as control request/response pairs on the standard transport,

 The module can act as a video source by capturing the contents of a screen region using `XShmGetImage` (MIT-SHM extension) for zero-copy capture within the same machine. The captured region is a configurable rectangle — typically one full monitor by its XRandR geometry, but can be any sub-region.

+Raw captured pixels are uncompressed — 1920×1080 at 32 bpp is ~8 MB per frame. Before the frame enters the transport it must be encoded. The grab loop calls the `codec` module to compress each frame, then encapsulates the result. The codec is configured per stream; see [Codec Module](#codec-module).
+
 The grab loop produces frames at a configured rate, encapsulates them, and feeds them into the transport like any other video source. Combined with geometry queries, a remote controller can enumerate monitors, select one, and start a screen grab stream without manual coordinate configuration.

 ### Frame Viewer Sink