github agent0ai/agent-zero v1.17

18 hours ago

Computer Use Remote: Full Desktop Control Pipeline

This release builds out the computer_use_remote tool into a complete, cross-platform host desktop control system with visual verification, multimodal capture handling, and platform-specific structural targeting.

Highlights

  • computer_use_remote exposed as a callable tool — The model can now invoke computer_use_remote directly in live sessions. Availability, trust mode, and re-arm enforcement remain runtime checks rather than prompt-loader gates.

  • Visual verification required for all desktop actions — State-changing desktop actions are treated as unverified attempts until a fresh screenshot visibly confirms the outcome. Agents must stop and not proceed when a screenshot is unavailable.

  • Screenshots attached as multimodal tool results — Computer-use captures are returned as real multimodal vision messages (not just text summaries), so the model can visually inspect the screen after each action. Older capture payloads are pruned to prevent runaway context growth.

  • Host desktop cleanly separated from Xpra desktopcomputer_use_remote is now the sole host desktop-control path; linux-desktop targets only the internal Docker/Xpra environment. Host-screen queries rank ahead of the Xpra skill while explicit "Agent Zero Desktop" requests still route correctly.

  • Platform-specific structural targeting skills:

    • macOS — Dedicated skill for Accessibility (AX) structural targeting with ax_snapshot and ax_action support, loaded only when the backend reports macOS capabilities.
    • Windows — UIA-based skill with window-management guidance, selector passthrough, and click-last workflow hints.
    • Linux — AT-SPI/Wayland skill with compact structural tree outlines in snapshot responses for semantic target selection.
    • Backend-specific action details are kept out of the generic prompt; the generic layer handles only backend discovery and skill loading.
  • Codex OAuth proxy preserves vision inputs — Image content parts are now correctly converted to Responses API input_image parts instead of being flattened to text, with regression coverage for multimodal tool results containing screenshots.

  • macOS approval denial handled gracefullyCOMPUTER_USE_APPROVAL_REQUIRED responses map to the existing re-arm-required stop flow, preventing agents from retrying or falling back to screenshots when permissions haven't been granted.

  • Prompt token accounting fixed for screenshots — Embedded base64 image data URLs are sanitized from token estimates so screenshot attachments no longer inflate context budgets.

  • Window-hide guidance updated — Ubuntu/GNOME/Wayland sessions now prefer Super+H over Alt+F9, with a reminder that keystroke results only prove the keys were sent, not that the action succeeded.

Don't miss a new agent-zero release

NewReleases is sending notifications on new releases.