Overview of Gemini 2.5 Computer Use Model
Google DeepMind has released the Gemini 2.5 Computer Use model, a specialized model built on the capabilities of Gemini 2.5 Pro, aimed at enabling the development of agents that can interact with user interfaces (UIs). This model outperforms competitors in web and mobile control benchmarks, achieving lower latency.
Model Features
The core capabilities of the Gemini 2.5 Computer Use model are exposed through the new computer_use tool in the Gemini API, which should operate in a loop. Inputs include the user request, a screenshot of the environment, and a history of recent actions. The input can also specify whether to exclude certain functions from the full list of supported UI actions or to include additional custom functions.
// Example code: calling the computer_use tool
computer_use(user_request, screenshot, action_history);
The model analyzes these inputs and generates a response, typically a function call representing one of the UI actions such as clicking or typing. The response may also include a request for end-user confirmation, which is required for certain actions like making a purchase. After executing the received action, a new screenshot of the GUI and the current URL are sent back to the Computer Use model as a function response, restarting the loop. This iterative process continues until the task is completed, an error occurs, or the interaction is terminated by a safety response or user decision.
Performance Evaluation
The Gemini 2.5 Computer Use model demonstrates strong performance across multiple web and mobile control benchmarks, delivering leading quality for browser control at the lowest latency. It has excelled in performance tests conducted by Browserbase for Online-Mind2Web.
Safety Measures
To address the unique risks posed by AI agents controlling computers, safety features have been built directly into the Gemini 2.5 model. Developers are empowered with safety controls to prevent the model from auto-completing potentially high-risk or harmful actions.
Developer Guide
The Gemini 2.5 model is now available in public preview via the Gemini API on Google AI Studio and Vertex AI. Developers can access documentation and resources to start building their own agent loops.
Blogger's Review: The launch of the Gemini 2.5 Computer Use model marks a significant advancement in AI's interaction with user interfaces. Its low latency and high accuracy showcase its strong potential in real-world applications. With enhanced safety measures, developers can build intelligent agents with greater confidence, making its widespread use in various applications highly anticipated.