With the release of DeepSeek V3 and R1, U.S. tech giants are struggling to regain their competitive edge. Now, DeepSeek has introduced Janus Pro, a state-of-the-art multimodal AI that further solidifies its dominance in both understanding and generative AI tasks. Janus Pro outperforms many leading models in multimodal reasoning, text-to-image generation, and instruction-following benchmarks.
Janus Pro, builds upon its predecessor, Janus, by introducing optimized training strategies, expanding its dataset, and scaling its model architecture. These enhancements enable Janus Pro to achieve notable improvements in multimodal understanding and text-to-image instruction-following capabilities, setting a new benchmark in the field of AI. In this article, we will dissect the research paper to help you understand what’s inside DeepSeek Janus Pro and how you can access DeepSeek Janus Pro 7B.
The DeepSeek Janus Pro 7B is an AI model designed to handle tasks across multiple formats, like text, images, and videos, all in one system. What makes it stand out is its unique design: it separates the processing of visual information into different pathways while using a single transformer framework to bring everything together. This smart setup makes the model more flexible and efficient, whether it’s analyzing content or generating new ideas. Compared to older multimodal AI models, Janus Pro 7B takes a big step forward in both performance and versatility.
Optimized Visual Processing:Janus Pro 7B uses separate pathways for handling visual data, like images and videos. This design boosts its ability to understand and process visual tasks more effectively than earlier models.
Unified Transformer Design: The model features a streamlined architecture that brings together different types of data (like text and visuals) seamlessly. This improves its ability to both understand and generate content across multiple formats.
Open and Accessible: Janus Pro 7B is open source and freely available on platforms like Hugging Face. This makes it easy for developers and researchers to dive in, experiment, and unlock its full potential without restrictions.
Multimodal Understanding and Visual Generation Results
Multimodal Understanding Performance
This graph compares average performance across four benchmarks that test a model’s ability to understand both text and visual data.
The x-axis represents the number of model parameters (billions), which indicates model size.
The y-axis shows average performance across these benchmarks.
Janus-Pro-7B is positioned at the top, showing that it outperforms many competing models, including LLaVA, VILA, and Emu3-Chat.
The red and green lines indicate different groups of models: the Janus-Pro family (unified models) and the LLaVA family (understanding only).
Instruction-Following for Image Generation
This graph evaluates how well models generate images based on text prompts.
Two benchmarks are used:
GenEval
DPG-Bench
The y-axis represents accuracy (%).
Janus-Pro models (Janus and Janus-Pro-7B) achieve the highest accuracy, surpassing SDXL, DALLE-3, and other vision models.
This suggests that Janus-Pro-7B is highly effective at generating images based on text prompts.
In a nutshell, Janus-Pro outperforms both unified multimodal models and specialized models, making it a top-performing AI for both understanding and generating visual content.
Key Takeaways
Janus-Pro-7B excels in multimodal understanding, outperforming competitors.
It also achieves state-of-the-art performance in text-to-image generation, making it a powerful model for creative AI tasks.
Its performance is strong across multiple benchmarks, proving it is a well-rounded AI system.
Key Advancements in Janus Pro
DeepSeek Janus Pro incorporates improvements in four primary areas: training strategies, data scaling, model architecture, and implementation efficiency.
1. Optimized Training Strategy
Janus-Pro refines its training pipeline to address computational inefficiencies observed in Janus:
Extended Stage I Training: The initial stage focuses on training adaptors and the image prediction head using ImageNet data. Janus-Pro lengthens this stage, ensuring a robust capability for modeling pixel dependencies, even with frozen language model parameters.
Streamlined Stage II Training: Unlike Janus, which allocated a large portion of training to ImageNet data for pixel dependency modeling, Janus-Pro skips this step in Stage II. Instead, it directly trains on dense text-to-image datasets, improving efficiency and performance in generating visually coherent images.
Dataset Ratio Adjustments: The supervised fine-tuning phase (Stage III) now uses a balanced multimodal dataset ratio (5:1:4 for multimodal, text, and text-to-image data, respectively). This adjustment maintains robust visual generation while enhancing multimodal understanding.
2. Data Scaling
To boost the multimodal understanding and visual generation capabilities, Janus-Pro significantly expands its dataset:
Multimodal Understanding Data: The dataset has grown by 90 million samples, including contributions from YFCC, Docmatix, and other sources. These datasets enrich the model’s ability to handle diverse tasks, from document analysis to conversational AI.
Visual Generation Data: Recognizing the limitations of noisy, real-world data, Janus-Pro integrates 72 million synthetic aesthetic samples, achieving a balanced 1:1 real-to-synthetic data ratio. These synthetic samples, curated for quality, accelerate convergence and enhance image generation stability and aesthetics.
3. Model Scaling
Janus-Pro scales the architecture of the original Janus:
Larger Language Model (LLM): The model size increases from 1.5 billion parameters to 7 billion, with improved hyperparameters. This scaling enhances both multimodal understanding and visual generation by speeding up convergence and improving generalization.
Decoupled Visual Encoding: The architecture employs independent encoders for multimodal understanding and generation. Image inputs are processed by SigLIP for high-dimensional semantic feature extraction, while visual generation utilizes a VQ tokenizer to convert images into discrete IDs.
Detailed Methodology of DeepSeek Janus Pro 7B
1. Architectural Overview
Janus-Pro adheres to an autoregressive framework with a decoupled visual encoding approach:
Multimodal Understanding: Features are flattened from a 2D grid into a 1D sequence. An adaptor then maps these features into the input space of the LLM.
Visual Generation: The VQ tokenizer converts images into discrete IDs. These IDs are flattened and mapped into the LLM’s input space using a generation adaptor.
Unified Processing: The multimodal feature sequences are concatenated and processed by the LLM, with separate prediction heads for text and image outputs.
1. Understanding (Processing Images to Generate Text)
This module enables the model to analyze and describe images based on an input query.
How It Works:
Input: Image
The model takes an image as input.
Und. Encoder (Understanding Encoder)
Extracts important visual features from the image (such as objects, colors, and spatial relationships).
Converts the raw image into a compressed representation that the transformer can understand.
Text Tokenizer
If a language instruction is provided (e.g., “What is in this image?”), it is tokenized into a numerical format.
Auto-Regressive Transformer
Processes both image features and text tokens to generate a text response.
Text De-Tokenizer
Converts the model’s numerical output into human-readable text.
Example: Input: An image of a cat sitting on a table + “Describe the image.” Output:“A small white cat is sitting on a wooden table.”
2. Image Generation (Processing Text to Generate Images)
This module enables the model to create new images from textual descriptions.
How It Works:
Input: Language Instruction
A user provides a text prompt describing the desired image (e.g., “A futuristic city at night.”).
Text Tokenizer
The text input is tokenized into numerical format.
Auto-Regressive Transformer
Predicts the image representation token by token.
Gen. Encoder (Generation Encoder)
Converts the predicted image representation into a structured format.
Image Decoder
Generates the final image based on the encoded representation.
Example: Input:“A dragon flying over a castle at sunset.” Output: AI-generated image of a dragon soaring above a medieval castle at sunset.
3. Key Components in the Model
Component
Function
Und. Encoder
Extracts visual features from input images.
Text Tokenizer
Converts text input into tokens for processing.
Auto-Regressive Transformer
Central module that handles both text and image generation sequentially.
Gen. Encoder
Converts generated image tokens into structured representations.
Image Decoder
Produces an image from encoded representations.
Text De-Tokenizer
Converts generated text tokens into human-readable responses.
4. Why This Architecture?
Unified Transformer Model: Uses the same transformer to process both images and text.
Sequential Generation: Outputs are generated step-by-step for both images and text.
Multi-Modal Learning: Can understand and generate images and text in a single system.
The DeepSeek Janus-Pro model is a powerful vision-language AI system that enables both image comprehension and text-to-image generation. By leveraging auto-regressive learning, it efficiently produces text and images in a structured and scalable manner. 🚀
2. Training Strategy Enhancements
Janus-Pro modifies the three-stage training pipeline:
Stage I: Focuses on ImageNet-based pretraining with extended training time.
Stage II: Discards ImageNet data in favor of dense text-to-image datasets, improving computational efficiency.
Stage III: Adjusts dataset ratios to balance multimodal, text, and text-to-image data.
3. Implementation Efficiency
Janus-Pro utilizes the HAI-LLM framework, leveraging NVIDIA A100 GPUs for distributed training. The entire training process is streamlined, taking 7 days for the 1.5B model and 14 days for the 7B model across multiple nodes.
Experimental Results
Janus-Pro demonstrates significant advancements over previous models:
Convergence Speed: Scaling to 7B parameters significantly reduces convergence time for multimodal understanding and visual generation tasks.
Improved Visual Generation: Synthetic data enhances text-to-image stability and aesthetics, though fine details (e.g., small facial features) remain challenging due to resolution limitations.
Enhanced Multimodal Understanding: Expanded datasets and a refined training strategy improve the model’s ability to comprehend and generate meaningful multimodal outputs.
Firstly, save the below given Python libraries and dependencies under requirements.txt in Google Colab and then run this:
pip install -r /content/requirements.txt
followed by the required libraries, use the below code:
import torch
from transformers import AutoConfig, AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
from PIL import Image
# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "<|User|>",
"content": f"<image_placeholder>\n{question}",
"images": [image],
},
{"role": "<|Assistant|>", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
The image contains a logo with a stylized design that includes a circular pattern resembling a target or a camera aperture. Within this design, there is a cartoon character with sunglasses and a hand gesture, which appears to be a playful or humorous representation.
The text next to the logo reads "License to Call." This suggests that the image is likely related to a service or product that involves calling or communication, possibly with a focus on licensing or authorization.
The overall design and text imply that the service or product is related to communication, possibly involving a license or authorization process.
Outputs of DeepSeek Janus Pro 7B
Image Description
DeepSeek Janus-Pro produces an impressive and human-like description with excellent structure, vivid imagery, and strong coherence. Minor refinements could make it even more concise and precise.
Text Recognition
The text recognition output is accurate, clear, and well-structured, effectively capturing the main heading. However, it misses smaller text details and could mention the stylized typography for a richer description. Overall, it’s a strong response but could be improved with more completeness and visual insights.
Text-To-Image Generation
A strong and diverse text-to-image generation output with accurate visuals and descriptive clarity. A few refinements, such as fixing text cut-offs and adding finer details, could elevate the quality further.
Checkout our detailed articles on DeepSeek working and comparison with similar models:
Despite its successes, Janus-Pro has certain limitations:
Resolution Constraints: The 384 × 384 resolution restricts performance in fine-grained tasks like OCR or detailed image generation.
Reconstruction Loss: The use of the VQ tokenizer introduces reconstruction losses, leading to under-detailed outputs in smaller image regions.
Text-to-Image Challenges: While stability and aesthetics have improved, achieving ultra-high fidelity in generated images remains an ongoing challenge.
Future work could focus on:
Increasing image resolution to address fine detail limitations.
Exploring alternative tokenization methods to reduce reconstruction losses.
Enhancing the training pipeline with adaptive methods for diverse tasks.
Conclusion
Janus-Pro marks a transformative step in multimodal AI. By optimizing training strategies, scaling data, and expanding model size, it achieves state-of-the-art results in multimodal understanding and text-to-image generation. Despite some limitations, Janus-Pro lays a strong foundation for future research in scalable, efficient multimodal AI systems. Its advancements highlight the growing potential of AI to bridge the gap between vision and language, inspiring further innovation in the field.
Ready to dive into the world of DeepSeek? Enroll in our course on accessing DeepSeek Janus Pro 7B today and unlock the power of multimodal AI!
Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
Asante sana, for this illuminating analysis.