Build a Local Vision Agent for Windows 11 using OmniParser V2 and OmniTool

Abhishek Kumar Last Updated : 01 Mar, 2025

6 min read

Imagine AI that doesn’t just think but sees and acts, interacting with your Windows 11 interface like a pro. Microsoft’s OmniParser V2 and OmniTool are here to make that a reality, powering autonomous GUI agents that redefine task automation and user experience. This article dives into their capabilities, offering a hands-on guide to set up your local environment and unlock their potential. From streamlining workflows to tackling real-world challenges, let’s explore how these tools can transform the way you work and play. Ready to build your own vision agent? Let’s get started!

Learning Objectives

Understand the core functionalities of OmniParser V2 and OmniTool in AI-driven GUI automation.
Learn how to set up and configure OmniParser V2 and OmniTool for local use.
Explore the interaction between AI agents and graphical user interfaces using vision models.
Identify real-world applications of OmniParser V2 and OmniTool in automation and accessibility.
Recognize responsible AI considerations and risk mitigation strategies in deploying autonomous GUI agents.

What is Microsoft OmniParser V2?
What is OmniTool?
Setting Up OmniParser V2 Setup
Setting Up OmniTool
Interacting with the Agent
Vision Models Supported by OmniTool
Responsible AI Considerations and Risks
Real-World Applications
Conclusion
Frequently Asked Questions

What is Microsoft OmniParser V2?

OmniParser V2 is a sophisticated AI screen parser designed to extract detailed, structured data from graphical user interfaces. It operates through a two-step process:

Detection Module: Utilizes a finely tuned YOLOv8 model to identify interactive elements such as buttons, icons, and menus within screenshots.
Captioning Module: Employs the Florence-2 foundation model to generate descriptive labels for these elements, clarifying their functions within the interface.

This dual approach enables large language models (LLMs) to comprehend GUIs thoroughly, facilitating accurate interactions and task execution. Compared to its predecessor, OmniParser V2 boasts significant enhancements, including a 60% reduction in latency and improved accuracy, particularly for smaller elements.

What is OmniTool?

OmniTool is a dockerized Windows system that integrates OmniParser V2 with leading LLMs such as OpenAI, DeepSeek, Qwen, and Anthropic. This integration enables fully autonomous agentic actions by AI agents, allowing them to perform tasks independently and streamline repetitive GUI interactions. OmniTool provides a sandbox environment for testing and deploying agents, ensuring safety and efficiency in real-world applications.

Introduction to OmniTool — Source: Author

Setting Up OmniParser V2 Setup

To leverage the full potential of OmniParser V2, follow these steps to set up your local environment:

Prerequisites

Ensure you have Python installed on your system.
Install the necessary dependencies using a Conda environment.

Installation

Clone the OmniParser V2 repository from GitHub.

git clone https://github.com/microsoft/OmniParser
cd OmniParser

Activate your Conda environment and install the required packages.

- conda create -n "omni" python==3.12
  #conda activate omni

Download the V2 weights (icon_caption_florence) using huggingface-cli.

rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights
mv weights/icon_caption weights/icon_caption_florence

Testing

Start the OmniParser V2 server and test its functionality using sample screenshots.

- python gradio_demo.py

You can read this article for setting up OmniParser V2 in your machine.

Setting Up OmniTool

To leverage the full potential of OmniTool, follow these steps to set up your local environment:

Prerequisites

Ensure you have 30GB of space remaining (5GB for ISO, 400MB for Docker container, 20GB for storage folder)
Install Docker Desktop on your system.
https://docs.docker.com/desktop/
Download the Windows 11 Enterprise Evaluation ISO from the Microsoft Evaluation Center. Rename the file to custom.iso and copy it to the directory OmniParser/omnitool/omnibox/vm/win11iso.

VM Setup

Navigate to vm management script directory with:

cd OmniParser/omnitool/omnibox/scripts

Build the docker container [400MB] and install the ISO to a storage folder [20GB] with ./manage_vm.sh create. The process is shown in the screenshots below and will take 20-90 mins depending on download speeds (commonly around 60 mins). When complete the terminal will show VM + server is up and running!. You can see the apps being installed in the VM by looking at the desktop via the NoVNC viewer (http://localhost:8006/vnc.html view_only=1&autoconnect=1&resize=scale). The terminal window shown in the NoVNC viewer will not be open on the desktop after the setup is done. If you can see it, wait and don’t click around!

After creating the first time it will store a save of the VM state in vm/win11storage. You can then manage the VM with ./manage_vm.sh start and ./manage_vm.sh stop. To delete the VM, use ./manage_vm.sh delete and delete the OmniParser/omnitool/omnibox/vm/win11storage directory.

Running OmniTool in gradio

Change into the gradio directory by running: cd OmniParser/omnitool/gradio
Activate your conda environment with: conda activate omni
Launch the server using: python app.py –windows_host_url localhost:8006 –omniparser_server_url localhost:8000
Open the URL displayed in your terminal, enter your API key, and begin interacting with the AI agent.
Ensure that the OmniParser server, OmniTool VM, and Gradio interface are running in separate terminal windows.

Output:

Interacting with the Agent

Once your environment is set up, you can use the Gradio UI to provide commands to the agent. This interface allows you to observe the agent’s reasoning and execution within the OmniBox VM. Example use cases include:

Opening Applications: Use the agent to launch applications by recognizing icons or menu items.
Navigating Menus: Automate menu navigation by identifying and interacting with specific UI elements.
Performing Searches: Leverage the agent to perform searches within applications or web browsers.

Vision Models Supported by OmniTool

OmniTool supports a variety of state-of-the-art vision models out of the box, including:

OpenAI (4o/o1/o3-mini): Known for its versatility and performance in understanding complex UI elements.
DeepSeek (R1): Offers robust capabilities for recognizing and interacting with GUI components.
Qwen (2.5VL): Provides advanced features for detailed UI analysis and automation.
Anthropic (Sonnet): Enhances agent capabilities with sophisticated language understanding and generation.

Responsible AI Considerations and Risks

To align with Microsoft’s AI principles and Responsible AI practices, OmniParser V2 and OmniTool incorporate several risk mitigation strategies:

Training Data: The icon caption model is trained with Responsible AI data to avoid inferring sensitive attributes from icon images.
Threat Model Analysis: Conducted using the Microsoft Threat Modeling Tool to identify and address potential risks.
User Guidance: Users are advised to apply OmniParser only for screenshots that do not contain harmful or violent content.
Human Oversight: Encouraging human oversight to minimize risks associated with autonomous agents.

Real-World Applications

The capabilities of OmniParser V2 and OmniTool enable a wide range of applications:

UI Automation: Automating interactions with graphical user interfaces to streamline workflows.
Accessibility Solutions: Providing structured data for assistive technologies to enhance user experiences.
User Interface Analysis: Evaluating and improving user interface designs based on extracted structured data.

Conclusion

OmniParser V2 and OmniTool represent a significant advancement in AI visual parsing and GUI automation. By integrating these tools, developers can create sophisticated AI agents that interact seamlessly with graphical user interfaces, unlocking new possibilities for automation and accessibility. As AI technology continues to evolve, the potential applications of OmniParser V2 and OmniTool will only grow, shaping the future of how we interact with digital interfaces.

Key Takeaways

OmniParser V2 enhances AI-driven GUI automation by accurately parsing and labeling interface elements.
OmniTool integrates OmniParser V2 with leading LLMs to enable fully autonomous agentic actions.
Setting up OmniParser V2 and OmniTool requires configuring dependencies, Docker, and a virtualized Windows environment.
Real-world applications include UI automation, accessibility solutions, and user interface analysis.
Responsible AI practices ensure ethical deployment by addressing risks through training data, oversight, and threat modeling.

Frequently Asked Questions

Q1. What is OmniParser V2?

A. OmniParser V2 is an AI-powered tool that extracts structured data from graphical user interfaces using detection and captioning models.

Q2. How does OmniTool enhance AI-driven GUI automation?

A. OmniTool integrates OmniParser V2 with LLMs to enable AI agents to autonomously interact with GUI elements.

Q3. What are the prerequisites for setting up OmniParser V2?

A. You need Python, Conda, and the necessary dependencies installed, along with OmniParser’s model weights.

Q4. How does OmniTool utilize a virtualized Windows environment?

A. OmniTool runs within a Dockerized Windows VM, allowing AI agents to interact safely with GUI applications.

Q5. What are some real-world applications of OmniParser V2 and OmniTool?

A. They are used for UI automation, accessibility solutions, and improving user interface design.

Abhishek Kumar

Hello, I'm Abhishek, a Data Engineer Trainee at Analytics Vidhya. I'm passionate about data engineering and video games I have experience in Apache Hadoop, AWS, and SQL,and I keep on exploring their intricacies and optimizing data workflows

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Build a Local Vision Agent for Windows 11 using OmniParser V2 and OmniTool

Learning Objectives

Table of contents

What is Microsoft OmniParser V2?

What is OmniTool?

Setting Up OmniParser V2 Setup

Prerequisites

Installation

Testing

Setting Up OmniTool

Prerequisites

VM Setup

Running OmniTool in gradio

Interacting with the Agent

Vision Models Supported by OmniTool

Responsible AI Considerations and Risks

Real-World Applications

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID