Files
video-setup/docs/xorg.md
mikael-lovqvists-claude-agent f5764940e6 Docs: display sink commands, GLFW multi-window notes, planning updates
- protocol.md: add START_DISPLAY (0x000A) and STOP_DISPLAY (0x000B) wire
  schemas and field descriptions; add both to command table
- xorg.md: add 'Multiple windows' section covering glfwPollEvents global
  behaviour, per-context glfwMakeContextCurrent requirement, and
  glfwInit/glfwTerminate ref-counting; includes the gotcha that
  short-circuiting the event loop can starve non-polled windows
- planning.md: add cooperative capture release deferred decision;
  add xorg viewer remote controls (zoom, pan, scale, future shader
  post-processing) to deferred decisions; note xorg viewer controls
  not yet exposed remotely in module table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 08:03:30 +00:00

127 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# X11 / Xorg Integration
See [Architecture Overview](../architecture.md).
An `xorg` module provides two capabilities that complement the V4L2 camera pipeline: screen geometry queries and an X11-based video feed viewer. Both operate as first-class node roles.
## Screen Geometry Queries (XRandR)
Using the XRandR extension, the module can enumerate connected outputs and retrieve their geometry — resolution, position within the desktop coordinate space, physical size, and refresh rate. This is useful for:
- **Routing decisions**: knowing the resolution of the target display before deciding how to scale or crop an incoming stream
- **Screen grab source**: determining the exact rectangle to capture for a given monitor
- **Multi-monitor layouts**: placing viewer windows correctly in a multi-head setup without guessing offsets
Queries are exposed as control request/response pairs on the standard transport, so a remote node can ask "what monitors does this machine have?" and receive structured geometry data without any X11 code on the asking side.
## Screen Grab Source
The module can act as a video source by capturing the contents of a screen region using `XShmGetImage` (MIT-SHM extension) for zero-copy capture within the same machine. The captured region is a configurable rectangle — typically one full monitor by its XRandR geometry, but can be any sub-region.
Raw captured pixels are uncompressed — 1920×1080 at 32 bpp is ~8 MB per frame. Before the frame enters the transport it must be encoded. The grab loop calls the `codec` module to compress each frame, then encapsulates the result. The codec is configured per stream; see [Codec Module](./codec.md).
The grab loop produces frames at a configured rate, encapsulates them, and feeds them into the transport like any other video source. Combined with geometry queries, a remote controller can enumerate monitors, select one, and start a screen grab stream without manual coordinate configuration.
## Frame Viewer Sink
The module can act as a video sink by creating a window and rendering the latest received frame into it. The window:
- Geometry (size and monitor placement) is specified at stream open time, using XRandR data when targeting a specific output
- Can be made fullscreen on a chosen output
- Displays the most recently received frame — driven by the low-latency output mode of the relay; never buffers for completeness
- Forwards keyboard and mouse events back upstream as `INPUT_EVENT` protocol messages, enabling remote control use cases
Scale and crop are applied in the renderer. Four display modes are supported (selected per viewer):
| Mode | Behaviour |
|---|---|
| `STRETCH` | Fill window, ignore aspect ratio |
| `FIT` | Largest rect that fits, preserve aspect, black bars |
| `FILL` | Scale to cover, preserve aspect, crop edges |
| `1:1` | Native pixel size, no scaling; excess cropped |
Each mode combines with an anchor (`CENTER` or `TOP_LEFT`) that controls placement when the frame does not fill the window exactly.
This allows a high-resolution source (Pi camera, screen grab) to be displayed scaled-down on a different machine, or viewed at native resolution with panning.
This makes it the display-side counterpart of the V4L2 capture source: a frame grabbed from a camera on a Pi can be viewed on any machine in the network running a viewer sink node, with the relay handling the path and delivery mode.
### Renderer: GLFW + OpenGL
The initial implementation uses **GLFW** for window and input management and **OpenGL** for rendering.
GLFW handles window creation, the event loop, resize, and input callbacks — it also supports Vulkan surface creation using the same API, which makes a future renderer swap straightforward. Input events (keyboard, mouse) are normalised by GLFW before being encoded as protocol messages.
The OpenGL renderer:
1. For **MJPEG**: calls `tjDecompressToYUVPlanes` (libjpeg-turbo) to decompress directly to planar YUV — no CPU-side color conversion. JPEG stores YCbCr internally so this is the minimal decode path: Huffman + DCT output lands directly in YUV planes.
2. Uploads Y, Cb, Cr as separate `GL_RED` textures (chroma at half resolution for 4:2:0 / 4:2:2 as delivered by most V4L2 cameras).
3. Fragment shader samples the three planes and applies the BT.601 matrix to produce RGB — a few lines of GLSL.
4. Scaling and filtering happen in the same shader pass.
5. Presents via GLFW's swap-buffers call.
For **raw pixel formats** (BGRA, YUV planar from the wire): uploaded directly without decode; shader handles any necessary swizzle or conversion.
This keeps CPU load minimal — the only CPU work for MJPEG is Huffman decode and DCT, which libjpeg-turbo runs with SIMD. All color conversion and scaling is on the GPU.
### Text overlays
Two tiers, implemented in order:
**Tier 1 — bitmap font atlas (done)**
`tools/gen_font_atlas/gen_font_atlas.py` (Python/Pillow) renders glyphs 32255 from DejaVu Sans at 16pt into a packed grayscale atlas using a skyline bin packer and emits `build/gen/font_atlas.h` — a C header with the pixel data as a `static const uint8_t` array and a `Font_Glyph[256]` metrics table indexed by codepoint.
At runtime the atlas is uploaded as a `GL_R8` texture. Each overlay is rendered as a batch of alpha-blended glyph quads preceded by a semi-transparent dark background rect (using a separate minimal screen-space rect shader driven by `gl_VertexID`). The public API is `xorg_viewer_set_overlay_text(v, idx, x, y, text, r, g, b)` and `xorg_viewer_clear_overlays(v)`. Up to 8 independent overlays are supported.
The generator runs automatically as a `make` dependency before compiling `xorg.c`. The Pillow build tool is the only Python dependency; there are no runtime font deps.
**Tier 2 — HarfBuzz + FreeType (future)**
A proper runtime font stack for full typography: correct shaping, kerning, ligatures, bidirectional text, non-Latin scripts. Added as a feature flag with its own runtime deps alongside the blit path.
When Tier 2 is implemented, the Pillow build dependency may be replaced by a purpose-built atlas generator (removing the Python dep entirely), if the blit path is still useful alongside the full shaping path.
### Render loop
The viewer is driven by incoming frames rather than a fixed-rate loop. Two polling functions are provided depending on the use case:
**Static image / test tool**`xorg_viewer_poll(v)` processes events then re-renders from existing textures:
```c
while (xorg_viewer_poll(v)) { /* wait for close */ }
```
**Live stream** — the push functions (`push_yuv420`, `push_mjpeg`, etc.) already upload and render. Use `xorg_viewer_handle_events(v)` to process window events without an extra render:
```c
while (1) {
/* block on V4L2/network fd until frame or timeout */
if (frame_available) {
xorg_viewer_push_mjpeg(v, data, size); /* upload + render */
}
if (!xorg_viewer_handle_events(v)) { break; }
}
```
A `framebuffer_size_callback` registered on the window calls `render()` synchronously during resize, so the image tracks the window edge without a one-frame lag.
Threading note: the GL context must be used from the thread that created it. In the video node, incoming frames arrive on a network receive thread. A frame queue between the receive thread and the render thread (which owns the GL context) is the correct model — the render thread drains the queue each poll iteration rather than having the network thread call push functions directly.
### Multiple windows
GLFW supports multiple windows from the same thread. `glfwCreateWindow` can be called repeatedly; each call returns an independent window handle with its own GL context. The video node uses this to display several streams simultaneously (one window per active `Display_Slot`).
**`glfwPollEvents` is global.** It drains the event queue for all windows at once, not just the one associated with the viewer it is called through. When iterating over multiple display slots and calling `xorg_viewer_handle_events` on each, only the first call does real work; subsequent calls are no-ops because the queue is already empty. This is harmless but worth knowing: if the loop is ever restructured so that event polling is conditional or short-circuited, all windows need at least one `glfwPollEvents` call per iteration or they will stop responding to input.
**Each window has its own GL context.** `glfwMakeContextCurrent` must be called before any GL operations to ensure calls go to the right context. The push functions (`push_yuv420`, `push_bgra`, `push_mjpeg`) and `poll` do this automatically. Code that calls GL functions directly must make the correct context current first.
**`glfwInit`/`glfwTerminate` are ref-counted** in the xorg module. The first `xorg_viewer_open` call initialises GLFW; `glfwTerminate` is deferred until the last viewer is closed. Do not call `glfwTerminate` directly — use `xorg_viewer_close` and let the ref count manage it.
### Renderer: Vulkan (future alternative)
A Vulkan renderer is planned as an alternative to the OpenGL one. GLFW's surface creation API is renderer-agnostic, so the window management and input handling code is shared. Only the renderer backend changes.
Vulkan offers more explicit control over presentation timing, multi-queue workloads, and compute shaders (e.g. on-GPU MJPEG decode via a compute pass if a suitable library is available). It is not needed for the initial viewer but worth having for high-frame-rate or multi-stream display scenarios.
The renderer selection should be a compile-time or runtime option — both implementations conform to the same internal interface (`render_frame(pixel_buffer, width, height, format)`).