Build a Local Vision Agent for Windows 11 using OmniParser V2 and OmniTool

Abhishek Kumar Last Updated : 01 Mar, 2025
6 min read

Imagine AI that doesn’t just think but sees and acts, interacting with your Windows 11 interface like a pro. Microsoft’s OmniParser V2 and OmniTool are here to make that a reality, powering autonomous GUI agents that redefine task automation and user experience. This article dives into their capabilities, offering a hands-on guide to set up your local environment and unlock their potential. From streamlining workflows to tackling real-world challenges, let’s explore how these tools can transform the way you work and play. Ready to build your own vision agent? Let’s get started!

Learning Objectives

  • Understand the core functionalities of OmniParser V2 and OmniTool in AI-driven GUI automation.
  • Learn how to set up and configure OmniParser V2 and OmniTool for local use.
  • Explore the interaction between AI agents and graphical user interfaces using vision models.
  • Identify real-world applications of OmniParser V2 and OmniTool in automation and accessibility.
  • Recognize responsible AI considerations and risk mitigation strategies in deploying autonomous GUI agents.

What is Microsoft OmniParser V2?

OmniParser V2 is a sophisticated AI screen parser designed to extract detailed, structured data from graphical user interfaces. It operates through a two-step process:

  • Detection Module: Utilizes a finely tuned YOLOv8 model to identify interactive elements such as buttons, icons, and menus within screenshots.
  • Captioning Module: Employs the Florence-2 foundation model to generate descriptive labels for these elements, clarifying their functions within the interface.

This dual approach enables large language models (LLMs) to comprehend GUIs thoroughly, facilitating accurate interactions and task execution. Compared to its predecessor, OmniParser V2 boasts significant enhancements, including a 60% reduction in latency and improved accuracy, particularly for smaller elements.

What is OmniTool?

OmniTool is a dockerized Windows system that integrates OmniParser V2 with leading LLMs such as OpenAI, DeepSeek, Qwen, and Anthropic. This integration enables fully autonomous agentic actions by AI agents, allowing them to perform tasks independently and streamline repetitive GUI interactions. OmniTool provides a sandbox environment for testing and deploying agents, ensuring safety and efficiency in real-world applications.

Introduction to OmniTool
Source: Author

Setting Up OmniParser V2 Setup

To leverage the full potential of OmniParser V2, follow these steps to set up your local environment:

Prerequisites

  • Ensure you have Python installed on your system.
  • Install the necessary dependencies using a Conda environment.

Installation

Clone the OmniParser V2 repository from GitHub.

  • git clone https://github.com/microsoft/OmniParser
  • cd OmniParser

Activate your Conda environment and install the required packages.

- conda create -n "omni" python==3.12
  #conda activate omni
  • Download the V2 weights (icon_caption_florence) using huggingface-cli.
rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights
mv weights/icon_caption weights/icon_caption_florence

Testing

Start the OmniParser V2 server and test its functionality using sample screenshots.

- python gradio_demo.py

You can read this article for setting up OmniParser V2 in your machine.

omniparser

Setting Up OmniTool

To leverage the full potential of OmniTool, follow these steps to set up your local environment:

Prerequisites

  • Ensure you have 30GB of space remaining (5GB for ISO, 400MB for Docker container, 20GB for storage folder)
  • Install Docker Desktop on your system.
    https://docs.docker.com/desktop/
  • Download the Windows 11 Enterprise Evaluation ISO from the Microsoft Evaluation Center. Rename the file to custom.iso and copy it to the directory OmniParser/omnitool/omnibox/vm/win11iso.

VM Setup

Navigate to vm management script directory with:

cd OmniParser/omnitool/omnibox/scripts

Build the docker container [400MB] and install the ISO to a storage folder [20GB] with ./manage_vm.sh create. The process is shown in the screenshots below and will take 20-90 mins depending on download speeds (commonly around 60 mins). When complete the terminal will show VM + server is up and running!. You can see the apps being installed in the VM by looking at the desktop via the NoVNC viewer (http://localhost:8006/vnc.html view_only=1&autoconnect=1&resize=scale). The terminal window shown in the NoVNC viewer will not be open on the desktop after the setup is done. If you can see it, wait and don’t click around!

output

After creating the first time it will store a save of the VM state in vm/win11storage. You can then manage the VM with ./manage_vm.sh start and ./manage_vm.sh stop. To delete the VM, use ./manage_vm.sh delete and delete the OmniParser/omnitool/omnibox/vm/win11storage directory.

Running OmniTool in gradio

  • Change into the gradio directory by running: cd OmniParser/omnitool/gradio
  • Activate your conda environment with: conda activate omni
  • Launch the server using: python app.py –windows_host_url localhost:8006 –omniparser_server_url localhost:8000
  • Open the URL displayed in your terminal, enter your API key, and begin interacting with the AI agent.
  • Ensure that the OmniParser server, OmniTool VM, and Gradio interface are running in separate terminal windows.
Running OmniTool in gradio

Output:

 OmniTool

Interacting with the Agent

Once your environment is set up, you can use the Gradio UI to provide commands to the agent. This interface allows you to observe the agent’s reasoning and execution within the OmniBox VM. Example use cases include:

  • Opening Applications: Use the agent to launch applications by recognizing icons or menu items.
    Navigating Menus: Automate menu navigation by identifying and interacting with specific UI elements.
  • Performing Searches: Leverage the agent to perform searches within applications or web browsers.

Vision Models Supported by OmniTool

OmniTool supports a variety of state-of-the-art vision models out of the box, including:

  • OpenAI (4o/o1/o3-mini): Known for its versatility and performance in understanding complex UI elements.
  • DeepSeek (R1): Offers robust capabilities for recognizing and interacting with GUI components.
  • Qwen (2.5VL): Provides advanced features for detailed UI analysis and automation.
  • Anthropic (Sonnet): Enhances agent capabilities with sophisticated language understanding and generation.

Responsible AI Considerations and Risks

To align with Microsoft’s AI principles and Responsible AI practices, OmniParser V2 and OmniTool incorporate several risk mitigation strategies:

  • Training Data: The icon caption model is trained with Responsible AI data to avoid inferring sensitive attributes from icon images.
  • Threat Model Analysis: Conducted using the Microsoft Threat Modeling Tool to identify and address potential risks.
  • User Guidance: Users are advised to apply OmniParser only for screenshots that do not contain harmful or violent content.
  • Human Oversight: Encouraging human oversight to minimize risks associated with autonomous agents.

Real-World Applications

The capabilities of OmniParser V2 and OmniTool enable a wide range of applications:

  • UI Automation: Automating interactions with graphical user interfaces to streamline workflows.
  • Accessibility Solutions: Providing structured data for assistive technologies to enhance user experiences.
  • User Interface Analysis: Evaluating and improving user interface designs based on extracted structured data.

Conclusion

OmniParser V2 and OmniTool represent a significant advancement in AI visual parsing and GUI automation. By integrating these tools, developers can create sophisticated AI agents that interact seamlessly with graphical user interfaces, unlocking new possibilities for automation and accessibility. As AI technology continues to evolve, the potential applications of OmniParser V2 and OmniTool will only grow, shaping the future of how we interact with digital interfaces.

Key Takeaways

  • OmniParser V2 enhances AI-driven GUI automation by accurately parsing and labeling interface elements.
  • OmniTool integrates OmniParser V2 with leading LLMs to enable fully autonomous agentic actions.
  • Setting up OmniParser V2 and OmniTool requires configuring dependencies, Docker, and a virtualized Windows environment.
  • Real-world applications include UI automation, accessibility solutions, and user interface analysis.
  • Responsible AI practices ensure ethical deployment by addressing risks through training data, oversight, and threat modeling.

Frequently Asked Questions

Q1. What is OmniParser V2?

A. OmniParser V2 is an AI-powered tool that extracts structured data from graphical user interfaces using detection and captioning models.

Q2. How does OmniTool enhance AI-driven GUI automation?

A. OmniTool integrates OmniParser V2 with LLMs to enable AI agents to autonomously interact with GUI elements.

Q3. What are the prerequisites for setting up OmniParser V2?

A. You need Python, Conda, and the necessary dependencies installed, along with OmniParser’s model weights.

Q4. How does OmniTool utilize a virtualized Windows environment?

A. OmniTool runs within a Dockerized Windows VM, allowing AI agents to interact safely with GUI applications.

Q5. What are some real-world applications of OmniParser V2 and OmniTool?

A. They are used for UI automation, accessibility solutions, and improving user interface design.

Hello, I'm Abhishek, a Data Engineer Trainee at Analytics Vidhya. I'm passionate about data engineering and video games I have experience in Apache Hadoop, AWS, and SQL,and I keep on exploring their intricacies and optimizing data workflows 

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details