- protocol.md: add START_DISPLAY (0x000A) and STOP_DISPLAY (0x000B) wire schemas and field descriptions; add both to command table - xorg.md: add 'Multiple windows' section covering glfwPollEvents global behaviour, per-context glfwMakeContextCurrent requirement, and glfwInit/glfwTerminate ref-counting; includes the gotcha that short-circuiting the event loop can starve non-polled windows - planning.md: add cooperative capture release deferred decision; add xorg viewer remote controls (zoom, pan, scale, future shader post-processing) to deferred decisions; note xorg viewer controls not yet exposed remotely in module table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
127 lines
9.5 KiB
Markdown
127 lines
9.5 KiB
Markdown
# X11 / Xorg Integration
|
||
|
||
See [Architecture Overview](../architecture.md).
|
||
|
||
An `xorg` module provides two capabilities that complement the V4L2 camera pipeline: screen geometry queries and an X11-based video feed viewer. Both operate as first-class node roles.
|
||
|
||
## Screen Geometry Queries (XRandR)
|
||
|
||
Using the XRandR extension, the module can enumerate connected outputs and retrieve their geometry — resolution, position within the desktop coordinate space, physical size, and refresh rate. This is useful for:
|
||
|
||
- **Routing decisions**: knowing the resolution of the target display before deciding how to scale or crop an incoming stream
|
||
- **Screen grab source**: determining the exact rectangle to capture for a given monitor
|
||
- **Multi-monitor layouts**: placing viewer windows correctly in a multi-head setup without guessing offsets
|
||
|
||
Queries are exposed as control request/response pairs on the standard transport, so a remote node can ask "what monitors does this machine have?" and receive structured geometry data without any X11 code on the asking side.
|
||
|
||
## Screen Grab Source
|
||
|
||
The module can act as a video source by capturing the contents of a screen region using `XShmGetImage` (MIT-SHM extension) for zero-copy capture within the same machine. The captured region is a configurable rectangle — typically one full monitor by its XRandR geometry, but can be any sub-region.
|
||
|
||
Raw captured pixels are uncompressed — 1920×1080 at 32 bpp is ~8 MB per frame. Before the frame enters the transport it must be encoded. The grab loop calls the `codec` module to compress each frame, then encapsulates the result. The codec is configured per stream; see [Codec Module](./codec.md).
|
||
|
||
The grab loop produces frames at a configured rate, encapsulates them, and feeds them into the transport like any other video source. Combined with geometry queries, a remote controller can enumerate monitors, select one, and start a screen grab stream without manual coordinate configuration.
|
||
|
||
## Frame Viewer Sink
|
||
|
||
The module can act as a video sink by creating a window and rendering the latest received frame into it. The window:
|
||
|
||
- Geometry (size and monitor placement) is specified at stream open time, using XRandR data when targeting a specific output
|
||
- Can be made fullscreen on a chosen output
|
||
- Displays the most recently received frame — driven by the low-latency output mode of the relay; never buffers for completeness
|
||
- Forwards keyboard and mouse events back upstream as `INPUT_EVENT` protocol messages, enabling remote control use cases
|
||
|
||
Scale and crop are applied in the renderer. Four display modes are supported (selected per viewer):
|
||
|
||
| Mode | Behaviour |
|
||
|---|---|
|
||
| `STRETCH` | Fill window, ignore aspect ratio |
|
||
| `FIT` | Largest rect that fits, preserve aspect, black bars |
|
||
| `FILL` | Scale to cover, preserve aspect, crop edges |
|
||
| `1:1` | Native pixel size, no scaling; excess cropped |
|
||
|
||
Each mode combines with an anchor (`CENTER` or `TOP_LEFT`) that controls placement when the frame does not fill the window exactly.
|
||
|
||
This allows a high-resolution source (Pi camera, screen grab) to be displayed scaled-down on a different machine, or viewed at native resolution with panning.
|
||
|
||
This makes it the display-side counterpart of the V4L2 capture source: a frame grabbed from a camera on a Pi can be viewed on any machine in the network running a viewer sink node, with the relay handling the path and delivery mode.
|
||
|
||
### Renderer: GLFW + OpenGL
|
||
|
||
The initial implementation uses **GLFW** for window and input management and **OpenGL** for rendering.
|
||
|
||
GLFW handles window creation, the event loop, resize, and input callbacks — it also supports Vulkan surface creation using the same API, which makes a future renderer swap straightforward. Input events (keyboard, mouse) are normalised by GLFW before being encoded as protocol messages.
|
||
|
||
The OpenGL renderer:
|
||
1. For **MJPEG**: calls `tjDecompressToYUVPlanes` (libjpeg-turbo) to decompress directly to planar YUV — no CPU-side color conversion. JPEG stores YCbCr internally so this is the minimal decode path: Huffman + DCT output lands directly in YUV planes.
|
||
2. Uploads Y, Cb, Cr as separate `GL_RED` textures (chroma at half resolution for 4:2:0 / 4:2:2 as delivered by most V4L2 cameras).
|
||
3. Fragment shader samples the three planes and applies the BT.601 matrix to produce RGB — a few lines of GLSL.
|
||
4. Scaling and filtering happen in the same shader pass.
|
||
5. Presents via GLFW's swap-buffers call.
|
||
|
||
For **raw pixel formats** (BGRA, YUV planar from the wire): uploaded directly without decode; shader handles any necessary swizzle or conversion.
|
||
|
||
This keeps CPU load minimal — the only CPU work for MJPEG is Huffman decode and DCT, which libjpeg-turbo runs with SIMD. All color conversion and scaling is on the GPU.
|
||
|
||
### Text overlays
|
||
|
||
Two tiers, implemented in order:
|
||
|
||
**Tier 1 — bitmap font atlas (done)**
|
||
|
||
`tools/gen_font_atlas/gen_font_atlas.py` (Python/Pillow) renders glyphs 32–255 from DejaVu Sans at 16pt into a packed grayscale atlas using a skyline bin packer and emits `build/gen/font_atlas.h` — a C header with the pixel data as a `static const uint8_t` array and a `Font_Glyph[256]` metrics table indexed by codepoint.
|
||
|
||
At runtime the atlas is uploaded as a `GL_R8` texture. Each overlay is rendered as a batch of alpha-blended glyph quads preceded by a semi-transparent dark background rect (using a separate minimal screen-space rect shader driven by `gl_VertexID`). The public API is `xorg_viewer_set_overlay_text(v, idx, x, y, text, r, g, b)` and `xorg_viewer_clear_overlays(v)`. Up to 8 independent overlays are supported.
|
||
|
||
The generator runs automatically as a `make` dependency before compiling `xorg.c`. The Pillow build tool is the only Python dependency; there are no runtime font deps.
|
||
|
||
**Tier 2 — HarfBuzz + FreeType (future)**
|
||
|
||
A proper runtime font stack for full typography: correct shaping, kerning, ligatures, bidirectional text, non-Latin scripts. Added as a feature flag with its own runtime deps alongside the blit path.
|
||
|
||
When Tier 2 is implemented, the Pillow build dependency may be replaced by a purpose-built atlas generator (removing the Python dep entirely), if the blit path is still useful alongside the full shaping path.
|
||
|
||
### Render loop
|
||
|
||
The viewer is driven by incoming frames rather than a fixed-rate loop. Two polling functions are provided depending on the use case:
|
||
|
||
**Static image / test tool** — `xorg_viewer_poll(v)` processes events then re-renders from existing textures:
|
||
|
||
```c
|
||
while (xorg_viewer_poll(v)) { /* wait for close */ }
|
||
```
|
||
|
||
**Live stream** — the push functions (`push_yuv420`, `push_mjpeg`, etc.) already upload and render. Use `xorg_viewer_handle_events(v)` to process window events without an extra render:
|
||
|
||
```c
|
||
while (1) {
|
||
/* block on V4L2/network fd until frame or timeout */
|
||
if (frame_available) {
|
||
xorg_viewer_push_mjpeg(v, data, size); /* upload + render */
|
||
}
|
||
if (!xorg_viewer_handle_events(v)) { break; }
|
||
}
|
||
```
|
||
|
||
A `framebuffer_size_callback` registered on the window calls `render()` synchronously during resize, so the image tracks the window edge without a one-frame lag.
|
||
|
||
Threading note: the GL context must be used from the thread that created it. In the video node, incoming frames arrive on a network receive thread. A frame queue between the receive thread and the render thread (which owns the GL context) is the correct model — the render thread drains the queue each poll iteration rather than having the network thread call push functions directly.
|
||
|
||
### Multiple windows
|
||
|
||
GLFW supports multiple windows from the same thread. `glfwCreateWindow` can be called repeatedly; each call returns an independent window handle with its own GL context. The video node uses this to display several streams simultaneously (one window per active `Display_Slot`).
|
||
|
||
**`glfwPollEvents` is global.** It drains the event queue for all windows at once, not just the one associated with the viewer it is called through. When iterating over multiple display slots and calling `xorg_viewer_handle_events` on each, only the first call does real work; subsequent calls are no-ops because the queue is already empty. This is harmless but worth knowing: if the loop is ever restructured so that event polling is conditional or short-circuited, all windows need at least one `glfwPollEvents` call per iteration or they will stop responding to input.
|
||
|
||
**Each window has its own GL context.** `glfwMakeContextCurrent` must be called before any GL operations to ensure calls go to the right context. The push functions (`push_yuv420`, `push_bgra`, `push_mjpeg`) and `poll` do this automatically. Code that calls GL functions directly must make the correct context current first.
|
||
|
||
**`glfwInit`/`glfwTerminate` are ref-counted** in the xorg module. The first `xorg_viewer_open` call initialises GLFW; `glfwTerminate` is deferred until the last viewer is closed. Do not call `glfwTerminate` directly — use `xorg_viewer_close` and let the ref count manage it.
|
||
|
||
### Renderer: Vulkan (future alternative)
|
||
|
||
A Vulkan renderer is planned as an alternative to the OpenGL one. GLFW's surface creation API is renderer-agnostic, so the window management and input handling code is shared. Only the renderer backend changes.
|
||
|
||
Vulkan offers more explicit control over presentation timing, multi-queue workloads, and compute shaders (e.g. on-GPU MJPEG decode via a compute pass if a suitable library is available). It is not needed for the initial viewer but worth having for high-frame-rate or multi-stream display scenarios.
|
||
|
||
The renderer selection should be a compile-time or runtime option — both implementations conform to the same internal interface (`render_frame(pixel_buffer, width, height, format)`).
|