Computer Use Remote: Full Desktop Control Pipeline
This release builds out the computer_use_remote tool into a complete, cross-platform host desktop control system with visual verification, multimodal capture handling, and platform-specific structural targeting.
Highlights
-
computer_use_remoteexposed as a callable tool — The model can now invokecomputer_use_remotedirectly in live sessions. Availability, trust mode, and re-arm enforcement remain runtime checks rather than prompt-loader gates. -
Visual verification required for all desktop actions — State-changing desktop actions are treated as unverified attempts until a fresh screenshot visibly confirms the outcome. Agents must stop and not proceed when a screenshot is unavailable.
-
Screenshots attached as multimodal tool results — Computer-use captures are returned as real multimodal vision messages (not just text summaries), so the model can visually inspect the screen after each action. Older capture payloads are pruned to prevent runaway context growth.
-
Host desktop cleanly separated from Xpra desktop —
computer_use_remoteis now the sole host desktop-control path;linux-desktoptargets only the internal Docker/Xpra environment. Host-screen queries rank ahead of the Xpra skill while explicit "Agent Zero Desktop" requests still route correctly. -
Platform-specific structural targeting skills:
- macOS — Dedicated skill for Accessibility (AX) structural targeting with
ax_snapshotandax_actionsupport, loaded only when the backend reports macOS capabilities. - Windows — UIA-based skill with window-management guidance, selector passthrough, and click-last workflow hints.
- Linux — AT-SPI/Wayland skill with compact structural tree outlines in snapshot responses for semantic target selection.
- Backend-specific action details are kept out of the generic prompt; the generic layer handles only backend discovery and skill loading.
- macOS — Dedicated skill for Accessibility (AX) structural targeting with
-
Codex OAuth proxy preserves vision inputs — Image content parts are now correctly converted to Responses API
input_imageparts instead of being flattened to text, with regression coverage for multimodal tool results containing screenshots. -
macOS approval denial handled gracefully —
COMPUTER_USE_APPROVAL_REQUIREDresponses map to the existing re-arm-required stop flow, preventing agents from retrying or falling back to screenshots when permissions haven't been granted. -
Prompt token accounting fixed for screenshots — Embedded base64 image data URLs are sanitized from token estimates so screenshot attachments no longer inflate context budgets.
-
Window-hide guidance updated — Ubuntu/GNOME/Wayland sessions now prefer
Super+HoverAlt+F9, with a reminder that keystroke results only prove the keys were sent, not that the action succeeded.