mikael-lovqvists-claude-agent/video-setup

Files

mikael-lovqvists-claude-agent f5764940e6 Docs: display sink commands, GLFW multi-window notes, planning updates

- protocol.md: add START_DISPLAY (0x000A) and STOP_DISPLAY (0x000B) wire
  schemas and field descriptions; add both to command table
- xorg.md: add 'Multiple windows' section covering glfwPollEvents global
  behaviour, per-context glfwMakeContextCurrent requirement, and
  glfwInit/glfwTerminate ref-counting; includes the gotcha that
  short-circuiting the event loop can starve non-polled windows
- planning.md: add cooperative capture release deferred decision;
  add xorg viewer remote controls (zoom, pan, scale, future shader
  post-processing) to deferred decisions; note xorg viewer controls
  not yet exposed remotely in module table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-29 08:03:30 +00:00

9.5 KiB

Raw Blame History

X11 / Xorg Integration

See Architecture Overview.

An xorg module provides two capabilities that complement the V4L2 camera pipeline: screen geometry queries and an X11-based video feed viewer. Both operate as first-class node roles.

Screen Geometry Queries (XRandR)

Using the XRandR extension, the module can enumerate connected outputs and retrieve their geometry — resolution, position within the desktop coordinate space, physical size, and refresh rate. This is useful for:

Routing decisions: knowing the resolution of the target display before deciding how to scale or crop an incoming stream
Screen grab source: determining the exact rectangle to capture for a given monitor
Multi-monitor layouts: placing viewer windows correctly in a multi-head setup without guessing offsets

Queries are exposed as control request/response pairs on the standard transport, so a remote node can ask "what monitors does this machine have?" and receive structured geometry data without any X11 code on the asking side.

Screen Grab Source

The module can act as a video source by capturing the contents of a screen region using XShmGetImage (MIT-SHM extension) for zero-copy capture within the same machine. The captured region is a configurable rectangle — typically one full monitor by its XRandR geometry, but can be any sub-region.

Raw captured pixels are uncompressed — 1920×1080 at 32 bpp is ~8 MB per frame. Before the frame enters the transport it must be encoded. The grab loop calls the codec module to compress each frame, then encapsulates the result. The codec is configured per stream; see Codec Module.

The grab loop produces frames at a configured rate, encapsulates them, and feeds them into the transport like any other video source. Combined with geometry queries, a remote controller can enumerate monitors, select one, and start a screen grab stream without manual coordinate configuration.

Frame Viewer Sink

The module can act as a video sink by creating a window and rendering the latest received frame into it. The window:

Geometry (size and monitor placement) is specified at stream open time, using XRandR data when targeting a specific output
Can be made fullscreen on a chosen output
Displays the most recently received frame — driven by the low-latency output mode of the relay; never buffers for completeness
Forwards keyboard and mouse events back upstream as INPUT_EVENT protocol messages, enabling remote control use cases

Scale and crop are applied in the renderer. Four display modes are supported (selected per viewer):

Mode	Behaviour
`STRETCH`	Fill window, ignore aspect ratio
`FIT`	Largest rect that fits, preserve aspect, black bars
`FILL`	Scale to cover, preserve aspect, crop edges
`1:1`	Native pixel size, no scaling; excess cropped

Each mode combines with an anchor (CENTER or TOP_LEFT) that controls placement when the frame does not fill the window exactly.

This allows a high-resolution source (Pi camera, screen grab) to be displayed scaled-down on a different machine, or viewed at native resolution with panning.

This makes it the display-side counterpart of the V4L2 capture source: a frame grabbed from a camera on a Pi can be viewed on any machine in the network running a viewer sink node, with the relay handling the path and delivery mode.

Renderer: GLFW + OpenGL

The initial implementation uses GLFW for window and input management and OpenGL for rendering.

GLFW handles window creation, the event loop, resize, and input callbacks — it also supports Vulkan surface creation using the same API, which makes a future renderer swap straightforward. Input events (keyboard, mouse) are normalised by GLFW before being encoded as protocol messages.

The OpenGL renderer:

For MJPEG: calls tjDecompressToYUVPlanes (libjpeg-turbo) to decompress directly to planar YUV — no CPU-side color conversion. JPEG stores YCbCr internally so this is the minimal decode path: Huffman + DCT output lands directly in YUV planes.
Uploads Y, Cb, Cr as separate GL_RED textures (chroma at half resolution for 4:2:0 / 4:2:2 as delivered by most V4L2 cameras).
Fragment shader samples the three planes and applies the BT.601 matrix to produce RGB — a few lines of GLSL.
Scaling and filtering happen in the same shader pass.
Presents via GLFW's swap-buffers call.

For raw pixel formats (BGRA, YUV planar from the wire): uploaded directly without decode; shader handles any necessary swizzle or conversion.

This keeps CPU load minimal — the only CPU work for MJPEG is Huffman decode and DCT, which libjpeg-turbo runs with SIMD. All color conversion and scaling is on the GPU.

Text overlays

Two tiers, implemented in order:

Tier 1 — bitmap font atlas (done)

tools/gen_font_atlas/gen_font_atlas.py (Python/Pillow) renders glyphs 32–255 from DejaVu Sans at 16pt into a packed grayscale atlas using a skyline bin packer and emits build/gen/font_atlas.h — a C header with the pixel data as a static const uint8_t array and a Font_Glyph[256] metrics table indexed by codepoint.

At runtime the atlas is uploaded as a GL_R8 texture. Each overlay is rendered as a batch of alpha-blended glyph quads preceded by a semi-transparent dark background rect (using a separate minimal screen-space rect shader driven by gl_VertexID). The public API is xorg_viewer_set_overlay_text(v, idx, x, y, text, r, g, b) and xorg_viewer_clear_overlays(v). Up to 8 independent overlays are supported.

The generator runs automatically as a make dependency before compiling xorg.c. The Pillow build tool is the only Python dependency; there are no runtime font deps.

Tier 2 — HarfBuzz + FreeType (future)

A proper runtime font stack for full typography: correct shaping, kerning, ligatures, bidirectional text, non-Latin scripts. Added as a feature flag with its own runtime deps alongside the blit path.

When Tier 2 is implemented, the Pillow build dependency may be replaced by a purpose-built atlas generator (removing the Python dep entirely), if the blit path is still useful alongside the full shaping path.

Render loop

The viewer is driven by incoming frames rather than a fixed-rate loop. Two polling functions are provided depending on the use case:

Static image / test tool — xorg_viewer_poll(v) processes events then re-renders from existing textures:

while (xorg_viewer_poll(v)) { /* wait for close */ }

Live stream — the push functions (push_yuv420, push_mjpeg, etc.) already upload and render. Use xorg_viewer_handle_events(v) to process window events without an extra render:

while (1) {
    /* block on V4L2/network fd until frame or timeout */
    if (frame_available) {
        xorg_viewer_push_mjpeg(v, data, size);  /* upload + render */
    }
    if (!xorg_viewer_handle_events(v)) { break; }
}

A framebuffer_size_callback registered on the window calls render() synchronously during resize, so the image tracks the window edge without a one-frame lag.

Threading note: the GL context must be used from the thread that created it. In the video node, incoming frames arrive on a network receive thread. A frame queue between the receive thread and the render thread (which owns the GL context) is the correct model — the render thread drains the queue each poll iteration rather than having the network thread call push functions directly.

Multiple windows

GLFW supports multiple windows from the same thread. glfwCreateWindow can be called repeatedly; each call returns an independent window handle with its own GL context. The video node uses this to display several streams simultaneously (one window per active Display_Slot).

glfwPollEvents is global. It drains the event queue for all windows at once, not just the one associated with the viewer it is called through. When iterating over multiple display slots and calling xorg_viewer_handle_events on each, only the first call does real work; subsequent calls are no-ops because the queue is already empty. This is harmless but worth knowing: if the loop is ever restructured so that event polling is conditional or short-circuited, all windows need at least one glfwPollEvents call per iteration or they will stop responding to input.

Each window has its own GL context. glfwMakeContextCurrent must be called before any GL operations to ensure calls go to the right context. The push functions (push_yuv420, push_bgra, push_mjpeg) and poll do this automatically. Code that calls GL functions directly must make the correct context current first.

glfwInit/glfwTerminate are ref-counted in the xorg module. The first xorg_viewer_open call initialises GLFW; glfwTerminate is deferred until the last viewer is closed. Do not call glfwTerminate directly — use xorg_viewer_close and let the ref count manage it.

Renderer: Vulkan (future alternative)

A Vulkan renderer is planned as an alternative to the OpenGL one. GLFW's surface creation API is renderer-agnostic, so the window management and input handling code is shared. Only the renderer backend changes.

Vulkan offers more explicit control over presentation timing, multi-queue workloads, and compute shaders (e.g. on-GPU MJPEG decode via a compute pass if a suitable library is available). It is not needed for the initial viewer but worth having for high-frame-rate or multi-stream display scenarios.

The renderer selection should be a compile-time or runtime option — both implementations conform to the same internal interface (render_frame(pixel_buffer, width, height, format)).

9.5 KiB Raw Blame History Unescape Escape