The Qwen family of vision-language models continues to evolve, with the release of Qwen2.5-VL marking a significant leap forward. Building on the success of Qwen2-VL, which was launched five months ago, Qwen2.5-VL benefits from valuable feedback and contributions from the developer community. This feedback has played a key role in refining the model, adding new features, and optimizing its abilities. In this article, we will be exploring the architecture of Qwen2.5-VL, along with its features and capabilities.
Alibaba Cloud’s Qwen model has gotten a vision upgrade with the new Qwen2.5-VL. It is designed to offer cutting-edge vision features for complex real-life tasks. Here’s what the advanced features of this new model can do:
Also Read: Chinese Giants Faceoff: DeepSeek-V3 vs Qwen2.5
The model’s architecture introduces two key innovations:
2. Streamlined Vision Encoder: It enhances the Vision Transformer (ViT) by improving attention mechanisms and activation functions. This facilitates faster and more efficient training and inference, making it work seamlessly with Qwen2.5’s language model.
Now let’s try out some prompts and test the capabilities of Qwen2.5-VL.
Qwen2.5-VL can identify an expanded range of categories, including flora, fauna, global landmarks, film/TV IPs, and commercial products.
Let’s test it with an example.
Prompt: “What are these attractions? Please give their names in Hindi and English.”
Response by Qwen2.5-VL-72B-Instruct:
The attractions in the images are:
The model employs bounding boxes and point coordinates for hierarchical object localization, outputting standardized JSON for spatial reasoning.
Prompt: “Locate every cake and describe their features, output the box coordinates in JSON format.”
Response by Qwen2.5-VL:
Enhanced OCR capabilities support multilingual, multi-orientation text extraction, critical for financial audits and compliance workflows.
Prompt: “Spotting all the text in the image with line-level, and output in JSON format.”
Response by Qwen2.5-VL:
A proprietary format extracts layout data (headings, paragraphs, images) from magazines, research papers, and mobile screenshots.
Prompt: “Structure this technical report into HTML with bounding boxes for titles, abstracts, and figures.”
Response by Qwen2.5-VL:
Qwen2.5-VL demonstrates state-of-the-art results across diverse benchmarks, solidifying its position as a leader in vision-language tasks. The flagship Qwen2.5-VL-72B-Instruct excels in college-level problem-solving, mathematical reasoning, document understanding, video analysis, and agent-based applications. Notably, it outperforms competitors in document/diagram comprehension and operates as a visual agent without task-specific fine-tuning.
The model outperforms competitors like Gemini-2 Flash, GPT-4o, and Claude3.5 Sonnet across benchmarks such as MMMU (70.2), DocVQA (96.4), and VideoMME (73.3/79.1).
For smaller models, Qwen2.5-VL-7B-Instruct surpasses GPT-4o-mini in multiple tasks, while the compact Qwen2.5-VL-3B—designed for edge AI—outperforms its predecessor, Qwen2-VL-7B, showcasing efficiency without compromising capability.
You can access Qwen2.5VL in 2 ways – by using Huggin Face Transformers or with the API. Let’s understand both these ways.
To access the Qwen2.5-VL model using Hugging Face, follow these steps:
First, make sure you have the latest version of Hugging Face Transformers and accelerate by installing them from the source:
pip install git+https://github.com/huggingface/transformers accelerate
Also, install qwen-vl-utils for handling various types of visual input:
pip install qwen-vl-utils[decord]==0.0.8
If you’re not on Linux, you can install without the [decord] feature. But if you need it, try installing from the source.
Use the following code to load the Qwen2.5-VL model and tokenizer from Hugging Face:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
# Load the processor for handling inputs (images, text, etc.)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
You can provide images and text in different formats (URLs, base64, or local paths). Here’s an example using an image URL:
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://path.to/your/image.jpg"},
{"type": "text", "text": "Describe this image."}
]
}
]
Prepare the input for the model, including images and text, and tokenize the text:
# Process the messages (images + text)
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda") # Move the input to GPU if available
Generate the model’s output based on the inputs:
# Generate the output from the model
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Decode the output
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)
Here’s how you can access the API for exploring the Qween 2.5 VL 72B model via Dashscope:
import dashscope
# Set your Dashscope API key
dashscope.api_key = "your_api_key"
# Make the API call with the desired model and messages
response = dashscope.MultiModalConversation.call(
model='qwen2.5-vl-72b-instruct',
messages=[{"role": "user", "content": [{"image": "image_url"}, {"text": "Query"}]}]
)
# You can access the response here
print(response)
Make sure to replace “your_api_key” with your actual API key and “image_url” with the URL of the image you want to use, along with the query text.
Qwen2.5-VL’s upgrades unlock diverse applications across industries, transforming how professionals interact with visual and textual data. Here are some of its real life use cases:
1. Document Analysis
The model revolutionizes workflows by effortlessly parsing complex materials like multilingual research papers, handwritten notes, financial invoices, and technical diagrams.
2. Industrial Automation
With pinpoint object detection and JSON-formatted coordinates, Qwen2.5-VL boosts precision in factories and warehouses.
3. Media Production
The model’s video analysis skills save hours for content creators. It can scan a 2-hour documentary to tag key scenes, generate chapter summaries, or extract clips of specific events (e.g., “all shots of the Eiffel Tower”).
4. Smart Device Integration
Qwen2.5-VL powers “AI assistants” that understand screen content and automate tasks.
Qwen2.5-VL is a major step forward in AI technology that combines text, images, and video understanding. Building on its earlier versions, this model introduces smarter features like reading complex documents, including handwritten notes and charts. It also pinpoints objects in images with precise coordinates and analyzes hours-long videos to identify key moments.
Easy to access through platforms like Hugging Face or APIs, Qwen2.5-VL makes powerful AI tools available to everyone. By tackling real-world challenges from reducing manual data entry to speeding up content creation Qwen2.5-VL proves that advanced AI isn’t just for labs. It’s a practical tool reshaping everyday workflows across the globe.
A. Qwen2.5-VL is an advanced multimodal AI model that can process and understand both images and text. It combines innovative technologies to provide accurate results for tasks like document parsing, object detection, and video analysis.
A. Qwen2.5-VL introduces architectural improvements like mRoPE for better spatial and temporal alignment, a more efficient vision encoder, and dynamic resolution training, allowing it to outperform models like GPT-4o and Gemini-2 Flash.
A. Industries such as finance, logistics, media, and education can benefit from Qwen2.5-VL’s capabilities in document processing, automation, and video understanding, helping solve complex challenges with improved efficiency.
A. Qwen2.5-VL is accessible through platforms like Hugging Face, APIs, and edge-compatible versions that can run on devices with limited computing power.
A. Qwen2.5-VL is unique due to its state-of-the-art performance, ability to process long videos, precision in object detection, and versatility in real-world applications, all achieved through advanced technological innovations.
A. Yes, Qwen2.5-VL excels in document parsing, making it an ideal solution for handling and analyzing large volumes of text and images from documents across different industries.
A. Yes, Qwen2.5-VL has edge-compatible versions that allow businesses with limited processing power to leverage its capabilities, making it accessible even for smaller companies or environments with less computational capacity.