Build a Local Vision Agent for Windows 11 using OmniParser V2 and OmniTool

Abhishek Kumar Last Updated : 01 Mar, 2025

6 min read

Imagine AI that doesn’t just think but sees and acts, interacting with your Windows 11 interface like a pro. Microsoft’s OmniParser V2 and OmniTool are here to make that a reality, powering autonomous GUI agents that redefine task automation and user experience. This article dives into their capabilities, offering a hands-on guide to set up your local environment and unlock their potential. From streamlining workflows to tackling real-world challenges, let’s explore how these tools can transform the way you work and play. Ready to build your own vision agent? Let’s get started!

Learning Objectives

Understand the core functionalities of OmniParser V2 and OmniTool in AI-driven GUI automation.
Learn how to set up and configure OmniParser V2 and OmniTool for local use.
Explore the interaction between AI agents and graphical user interfaces using vision models.
Identify real-world applications of OmniParser V2 and OmniTool in automation and accessibility.
Recognize responsible AI considerations and risk mitigation strategies in deploying autonomous GUI agents.

What is Microsoft OmniParser V2?
What is OmniTool?
Setting Up OmniParser V2 Setup
Setting Up OmniTool
Interacting with the Agent
Vision Models Supported by OmniTool
Responsible AI Considerations and Risks
Real-World Applications
Conclusion
Frequently Asked Questions

What is Microsoft OmniParser V2?

OmniParser V2 is a sophisticated AI screen parser designed to extract detailed, structured data from graphical user interfaces. It operates through a two-step process:

Detection Module: Utilizes a finely tuned YOLOv8 model to identify interactive elements such as buttons, icons, and menus within screenshots.
Captioning Module: Employs the Florence-2 foundation model to generate descriptive labels for these elements, clarifying their functions within the interface.

This dual approach enables large language models (LLMs) to comprehend GUIs thoroughly, facilitating accurate interactions and task execution. Compared to its predecessor, OmniParser V2 boasts significant enhancements, including a 60% reduction in latency and improved accuracy, particularly for smaller elements.

What is OmniTool?

OmniTool is a dockerized Windows system that integrates OmniParser V2 with leading LLMs such as OpenAI, DeepSeek, Qwen, and Anthropic. This integration enables fully autonomous agentic actions by AI agents, allowing them to perform tasks independently and streamline repetitive GUI interactions. OmniTool provides a sandbox environment for testing and deploying agents, ensuring safety and efficiency in real-world applications.

Introduction to OmniTool — Source: Author

Setting Up OmniParser V2 Setup

To leverage the full potential of OmniParser V2, follow these steps to set up your local environment:

Prerequisites

Ensure you have Python installed on your system.
Install the necessary dependencies using a Conda environment.

Installation

Clone the OmniParser V2 repository from GitHub.

git clone https://github.com/microsoft/OmniParser
cd OmniParser

Activate your Conda environment and install the required packages.

- conda create -n "omni" python==3.12
  #conda activate omni

Download the V2 weights (icon_caption_florence) using huggingface-cli.

rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights
mv weights/icon_caption weights/icon_caption_florence

Testing

Start the OmniParser V2 server and test its functionality using sample screenshots.

- python gradio_demo.py

You can read this article for setting up OmniParser V2 in your machine.

Setting Up OmniTool

To leverage the full potential of OmniTool, follow these steps to set up your local environment:

Prerequisites

Ensure you have 30GB of space remaining (5GB for ISO, 400MB for Docker container, 20GB for storage folder)
Install Docker Desktop on your system.
https://docs.docker.com/desktop/
Download the Windows 11 Enterprise Evaluation ISO from the Microsoft Evaluation Center. Rename the file to custom.iso and copy it to the directory OmniParser/omnitool/omnibox/vm/win11iso.

VM Setup

Navigate to vm management script directory with:

cd OmniParser/omnitool/omnibox/scripts

Build the docker container [400MB] and install the ISO to a storage folder [20GB] with ./manage_vm.sh create. The process is shown in the screenshots below and will take 20-90 mins depending on download speeds (commonly around 60 mins). When complete the terminal will show VM + server is up and running!. You can see the apps being installed in the VM by looking at the desktop via the NoVNC viewer (http://localhost:8006/vnc.html view_only=1&autoconnect=1&resize=scale). The terminal window shown in the NoVNC viewer will not be open on the desktop after the setup is done. If you can see it, wait and don’t click around!

After creating the first time it will store a save of the VM state in vm/win11storage. You can then manage the VM with ./manage_vm.sh start and ./manage_vm.sh stop. To delete the VM, use ./manage_vm.sh delete and delete the OmniParser/omnitool/omnibox/vm/win11storage directory.

Running OmniTool in gradio

Change into the gradio directory by running: cd OmniParser/omnitool/gradio
Activate your conda environment with: conda activate omni
Launch the server using: python app.py –windows_host_url localhost:8006 –omniparser_server_url localhost:8000
Open the URL displayed in your terminal, enter your API key, and begin interacting with the AI agent.
Ensure that the OmniParser server, OmniTool VM, and Gradio interface are running in separate terminal windows.

Output:

Interacting with the Agent

Once your environment is set up, you can use the Gradio UI to provide commands to the agent. This interface allows you to observe the agent’s reasoning and execution within the OmniBox VM. Example use cases include:

Opening Applications: Use the agent to launch applications by recognizing icons or menu items.
Navigating Menus: Automate menu navigation by identifying and interacting with specific UI elements.
Performing Searches: Leverage the agent to perform searches within applications or web browsers.

Vision Models Supported by OmniTool

OmniTool supports a variety of state-of-the-art vision models out of the box, including:

OpenAI (4o/o1/o3-mini): Known for its versatility and performance in understanding complex UI elements.
DeepSeek (R1): Offers robust capabilities for recognizing and interacting with GUI components.
Qwen (2.5VL): Provides advanced features for detailed UI analysis and automation.
Anthropic (Sonnet): Enhances agent capabilities with sophisticated language understanding and generation.

Responsible AI Considerations and Risks

To align with Microsoft’s AI principles and Responsible AI practices, OmniParser V2 and OmniTool incorporate several risk mitigation strategies:

Training Data: The icon caption model is trained with Responsible AI data to avoid inferring sensitive attributes from icon images.
Threat Model Analysis: Conducted using the Microsoft Threat Modeling Tool to identify and address potential risks.
User Guidance: Users are advised to apply OmniParser only for screenshots that do not contain harmful or violent content.
Human Oversight: Encouraging human oversight to minimize risks associated with autonomous agents.

Real-World Applications

The capabilities of OmniParser V2 and OmniTool enable a wide range of applications:

UI Automation: Automating interactions with graphical user interfaces to streamline workflows.
Accessibility Solutions: Providing structured data for assistive technologies to enhance user experiences.
User Interface Analysis: Evaluating and improving user interface designs based on extracted structured data.

Conclusion

OmniParser V2 and OmniTool represent a significant advancement in AI visual parsing and GUI automation. By integrating these tools, developers can create sophisticated AI agents that interact seamlessly with graphical user interfaces, unlocking new possibilities for automation and accessibility. As AI technology continues to evolve, the potential applications of OmniParser V2 and OmniTool will only grow, shaping the future of how we interact with digital interfaces.

Key Takeaways

OmniParser V2 enhances AI-driven GUI automation by accurately parsing and labeling interface elements.
OmniTool integrates OmniParser V2 with leading LLMs to enable fully autonomous agentic actions.
Setting up OmniParser V2 and OmniTool requires configuring dependencies, Docker, and a virtualized Windows environment.
Real-world applications include UI automation, accessibility solutions, and user interface analysis.
Responsible AI practices ensure ethical deployment by addressing risks through training data, oversight, and threat modeling.

Frequently Asked Questions

Q1. What is OmniParser V2?

A. OmniParser V2 is an AI-powered tool that extracts structured data from graphical user interfaces using detection and captioning models.

Q2. How does OmniTool enhance AI-driven GUI automation?

A. OmniTool integrates OmniParser V2 with LLMs to enable AI agents to autonomously interact with GUI elements.

Q3. What are the prerequisites for setting up OmniParser V2?

A. You need Python, Conda, and the necessary dependencies installed, along with OmniParser’s model weights.

Q4. How does OmniTool utilize a virtualized Windows environment?

A. OmniTool runs within a Dockerized Windows VM, allowing AI agents to interact safely with GUI applications.

Q5. What are some real-world applications of OmniParser V2 and OmniTool?

A. They are used for UI automation, accessibility solutions, and improving user interface design.

Abhishek Kumar

Hello, I'm Abhishek, a Data Engineer Trainee at Analytics Vidhya. I'm passionate about data engineering and video games I have experience in Apache Hadoop, AWS, and SQL,and I keep on exploring their intricacies and optimizing data workflows

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Build a Local Vision Agent for Windows 11 using OmniParser V2 and OmniTool

Learning Objectives

Table of contents

What is Microsoft OmniParser V2?

What is OmniTool?

Setting Up OmniParser V2 Setup

Prerequisites

Installation

Testing

Setting Up OmniTool

Prerequisites

VM Setup

Running OmniTool in gradio

Interacting with the Agent

Vision Models Supported by OmniTool

Responsible AI Considerations and Risks

Real-World Applications

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV