How to Access DeepSeek Janus Pro 7B?

Pankaj Singh Last Updated : 29 Jan, 2025

9 min read

With the release of DeepSeek V3 and R1, U.S. tech giants are struggling to regain their competitive edge. Now, DeepSeek has introduced Janus Pro, a state-of-the-art multimodal AI that further solidifies its dominance in both understanding and generative AI tasks. Janus Pro outperforms many leading models in multimodal reasoning, text-to-image generation, and instruction-following benchmarks.

Janus Pro, builds upon its predecessor, Janus, by introducing optimized training strategies, expanding its dataset, and scaling its model architecture. These enhancements enable Janus Pro to achieve notable improvements in multimodal understanding and text-to-image instruction-following capabilities, setting a new benchmark in the field of AI. In this article, we will dissect the research paper to help you understand what’s inside DeepSeek Janus Pro and how you can access DeepSeek Janus Pro 7B.

What is DeepSeek Janus Pro 7B?
Multimodal Understanding and Visual Generation Results
Key Advancements in Janus Pro
Detailed Methodology of DeepSeek Janus Pro 7B
How to Access DeepSeek Janus Pro 7B?
Outputs of DeepSeek Janus Pro 7B
Limitations and Future Directions
Conclusion

What is DeepSeek Janus Pro 7B?

The DeepSeek Janus Pro 7B is an AI model designed to handle tasks across multiple formats, like text, images, and videos, all in one system. What makes it stand out is its unique design: it separates the processing of visual information into different pathways while using a single transformer framework to bring everything together. This smart setup makes the model more flexible and efficient, whether it’s analyzing content or generating new ideas. Compared to older multimodal AI models, Janus Pro 7B takes a big step forward in both performance and versatility.

Optimized Visual Processing: Janus Pro 7B uses separate pathways for handling visual data, like images and videos. This design boosts its ability to understand and process visual tasks more effectively than earlier models.
Unified Transformer Design: The model features a streamlined architecture that brings together different types of data (like text and visuals) seamlessly. This improves its ability to both understand and generate content across multiple formats.
Open and Accessible: Janus Pro 7B is open source and freely available on platforms like Hugging Face. This makes it easy for developers and researchers to dive in, experiment, and unlock its full potential without restrictions.

Multimodal Understanding and Visual Generation Results

DeepSeek janus pro 7B — Source: DeepSeek Janus Pro Paper

Multimodal Understanding Performance

This graph compares average performance across four benchmarks that test a model’s ability to understand both text and visual data.
The x-axis represents the number of model parameters (billions), which indicates model size.
The y-axis shows average performance across these benchmarks.
Janus-Pro-7B is positioned at the top, showing that it outperforms many competing models, including LLaVA, VILA, and Emu3-Chat.
The red and green lines indicate different groups of models: the Janus-Pro family (unified models) and the LLaVA family (understanding only).

Instruction-Following for Image Generation

This graph evaluates how well models generate images based on text prompts.
Two benchmarks are used:
- GenEval
- DPG-Bench
The y-axis represents accuracy (%).
Janus-Pro models (Janus and Janus-Pro-7B) achieve the highest accuracy, surpassing SDXL, DALLE-3, and other vision models.
This suggests that Janus-Pro-7B is highly effective at generating images based on text prompts.

In a nutshell, Janus-Pro outperforms both unified multimodal models and specialized models, making it a top-performing AI for both understanding and generating visual content.

Key Takeaways

Janus-Pro-7B excels in multimodal understanding, outperforming competitors.
It also achieves state-of-the-art performance in text-to-image generation, making it a powerful model for creative AI tasks.
Its performance is strong across multiple benchmarks, proving it is a well-rounded AI system.

Key Advancements in Janus Pro

DeepSeek Janus Pro incorporates improvements in four primary areas: training strategies, data scaling, model architecture, and implementation efficiency.

1. Optimized Training Strategy

Janus-Pro refines its training pipeline to address computational inefficiencies observed in Janus:

Extended Stage I Training: The initial stage focuses on training adaptors and the image prediction head using ImageNet data. Janus-Pro lengthens this stage, ensuring a robust capability for modeling pixel dependencies, even with frozen language model parameters.
Streamlined Stage II Training: Unlike Janus, which allocated a large portion of training to ImageNet data for pixel dependency modeling, Janus-Pro skips this step in Stage II. Instead, it directly trains on dense text-to-image datasets, improving efficiency and performance in generating visually coherent images.
Dataset Ratio Adjustments: The supervised fine-tuning phase (Stage III) now uses a balanced multimodal dataset ratio (5:1:4 for multimodal, text, and text-to-image data, respectively). This adjustment maintains robust visual generation while enhancing multimodal understanding.

2. Data Scaling

To boost the multimodal understanding and visual generation capabilities, Janus-Pro significantly expands its dataset:

Multimodal Understanding Data: The dataset has grown by 90 million samples, including contributions from YFCC, Docmatix, and other sources. These datasets enrich the model’s ability to handle diverse tasks, from document analysis to conversational AI.
Visual Generation Data: Recognizing the limitations of noisy, real-world data, Janus-Pro integrates 72 million synthetic aesthetic samples, achieving a balanced 1:1 real-to-synthetic data ratio. These synthetic samples, curated for quality, accelerate convergence and enhance image generation stability and aesthetics.

3. Model Scaling

Janus-Pro scales the architecture of the original Janus:

Larger Language Model (LLM): The model size increases from 1.5 billion parameters to 7 billion, with improved hyperparameters. This scaling enhances both multimodal understanding and visual generation by speeding up convergence and improving generalization.
Decoupled Visual Encoding: The architecture employs independent encoders for multimodal understanding and generation. Image inputs are processed by SigLIP for high-dimensional semantic feature extraction, while visual generation utilizes a VQ tokenizer to convert images into discrete IDs.

Detailed Methodology of DeepSeek Janus Pro 7B

1. Architectural Overview

Janus-Pro adheres to an autoregressive framework with a decoupled visual encoding approach:

Multimodal Understanding: Features are flattened from a 2D grid into a 1D sequence. An adaptor then maps these features into the input space of the LLM.
Visual Generation: The VQ tokenizer converts images into discrete IDs. These IDs are flattened and mapped into the LLM’s input space using a generation adaptor.
Unified Processing: The multimodal feature sequences are concatenated and processed by the LLM, with separate prediction heads for text and image outputs.

1. Understanding (Processing Images to Generate Text)

This module enables the model to analyze and describe images based on an input query.

How It Works:

Input: Image
- The model takes an image as input.
Und. Encoder (Understanding Encoder)
- Extracts important visual features from the image (such as objects, colors, and spatial relationships).
- Converts the raw image into a compressed representation that the transformer can understand.
Text Tokenizer
- If a language instruction is provided (e.g., “What is in this image?”), it is tokenized into a numerical format.
Auto-Regressive Transformer
- Processes both image features and text tokens to generate a text response.
Text De-Tokenizer
- Converts the model’s numerical output into human-readable text.

Example:
Input: An image of a cat sitting on a table + “Describe the image.”
Output: “A small white cat is sitting on a wooden table.”

2. Image Generation (Processing Text to Generate Images)

This module enables the model to create new images from textual descriptions.

How It Works:

Input: Language Instruction
- A user provides a text prompt describing the desired image (e.g., “A futuristic city at night.”).
Text Tokenizer
- The text input is tokenized into numerical format.
Auto-Regressive Transformer
- Predicts the image representation token by token.
Gen. Encoder (Generation Encoder)
- Converts the predicted image representation into a structured format.
Image Decoder
- Generates the final image based on the encoded representation.

Example:
Input: “A dragon flying over a castle at sunset.”
Output: AI-generated image of a dragon soaring above a medieval castle at sunset.

3. Key Components in the Model

Component	Function
Und. Encoder	Extracts visual features from input images.
Text Tokenizer	Converts text input into tokens for processing.
Auto-Regressive Transformer	Central module that handles both text and image generation sequentially.
Gen. Encoder	Converts generated image tokens into structured representations.
Image Decoder	Produces an image from encoded representations.
Text De-Tokenizer	Converts generated text tokens into human-readable responses.

4. Why This Architecture?

Unified Transformer Model: Uses the same transformer to process both images and text.
Sequential Generation: Outputs are generated step-by-step for both images and text.
Multi-Modal Learning: Can understand and generate images and text in a single system.

The DeepSeek Janus-Pro model is a powerful vision-language AI system that enables both image comprehension and text-to-image generation. By leveraging auto-regressive learning, it efficiently produces text and images in a structured and scalable manner. 🚀

2. Training Strategy Enhancements

Janus-Pro modifies the three-stage training pipeline:

Stage I: Focuses on ImageNet-based pretraining with extended training time.
Stage II: Discards ImageNet data in favor of dense text-to-image datasets, improving computational efficiency.
Stage III: Adjusts dataset ratios to balance multimodal, text, and text-to-image data.

3. Implementation Efficiency

Janus-Pro utilizes the HAI-LLM framework, leveraging NVIDIA A100 GPUs for distributed training. The entire training process is streamlined, taking 7 days for the 1.5B model and 14 days for the 7B model across multiple nodes.

Experimental Results

Janus-Pro demonstrates significant advancements over previous models:

Convergence Speed: Scaling to 7B parameters significantly reduces convergence time for multimodal understanding and visual generation tasks.
Improved Visual Generation: Synthetic data enhances text-to-image stability and aesthetics, though fine details (e.g., small facial features) remain challenging due to resolution limitations.
Enhanced Multimodal Understanding: Expanded datasets and a refined training strategy improve the model’s ability to comprehend and generate meaningful multimodal outputs.

Model of Janus Series:

Model	Sequence Length	Download
Janus-1.3B	4096	🤗 Hugging Face
JanusFlow-1.3B	4096	🤗 Hugging Face
Janus-Pro-1B	4096	🤗 Hugging Face
Janus-Pro-7B	4096	🤗 Hugging Face

How to Access DeepSeek Janus Pro 7B?

Firstly, save the below given Python libraries and dependencies under requirements.txt in Google Colab and then run this:

pip install -r /content/requirements.txt

followed by the required libraries, use the below code:

import torch
from transformers import AutoConfig, AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
from PIL import Image

# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# # run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

Refer to this for full code with Gradio: deepseek-ai/Janus-Pro-7B

Image

Output

The image contains a logo with a stylized design that includes a circular
 pattern resembling a target or a camera aperture. Within this design, there
 is a cartoon character with sunglasses and a hand gesture, which appears to
 be a playful or humorous representation.

The text next to the logo reads "License to Call." This suggests that the
 image is likely related to a service or product that involves calling or
 communication, possibly with a focus on licensing or authorization.

The overall design and text imply that the service or product is related to
 communication, possibly involving a license or authorization process.

Outputs of DeepSeek Janus Pro 7B

Image Description

DeepSeek Janus-Pro produces an impressive and human-like description with excellent structure, vivid imagery, and strong coherence. Minor refinements could make it even more concise and precise.

Text Recognition

The text recognition output is accurate, clear, and well-structured, effectively capturing the main heading. However, it misses smaller text details and could mention the stylized typography for a richer description. Overall, it’s a strong response but could be improved with more completeness and visual insights.

Text-To-Image Generation

A strong and diverse text-to-image generation output with accurate visuals and descriptive clarity. A few refinements, such as fixing text cut-offs and adding finer details, could elevate the quality further.

Checkout our detailed articles on DeepSeek working and comparison with similar models:

Limitations and Future Directions

Despite its successes, Janus-Pro has certain limitations:

Resolution Constraints: The 384 × 384 resolution restricts performance in fine-grained tasks like OCR or detailed image generation.
Reconstruction Loss: The use of the VQ tokenizer introduces reconstruction losses, leading to under-detailed outputs in smaller image regions.
Text-to-Image Challenges: While stability and aesthetics have improved, achieving ultra-high fidelity in generated images remains an ongoing challenge.

Future work could focus on:

Increasing image resolution to address fine detail limitations.
Exploring alternative tokenization methods to reduce reconstruction losses.
Enhancing the training pipeline with adaptive methods for diverse tasks.

Conclusion

Janus-Pro marks a transformative step in multimodal AI. By optimizing training strategies, scaling data, and expanding model size, it achieves state-of-the-art results in multimodal understanding and text-to-image generation. Despite some limitations, Janus-Pro lays a strong foundation for future research in scalable, efficient multimodal AI systems. Its advancements highlight the growing potential of AI to bridge the gap between vision and language, inspiring further innovation in the field.

Ready to dive into the world of DeepSeek? Enroll in our course on accessing DeepSeek Janus Pro 7B today and unlock the power of multimodal AI!

Stay tuned to Analytics Vidhya Blog for more such awesome content!

Pankaj Singh

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Advanced Generative AI LLMs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

How to Access DeepSeek Janus Pro 7B?

Table of contents

What is DeepSeek Janus Pro 7B?

Multimodal Understanding and Visual Generation Results

Multimodal Understanding Performance

Instruction-Following for Image Generation

Key Takeaways

Key Advancements in Janus Pro

1. Optimized Training Strategy

2. Data Scaling

3. Model Scaling

Detailed Methodology of DeepSeek Janus Pro 7B

1. Architectural Overview

1. Understanding (Processing Images to Generate Text)

How It Works:

2. Image Generation (Processing Text to Generate Images)

How It Works:

3. Key Components in the Model

4. Why This Architecture?

2. Training Strategy Enhancements

3. Implementation Efficiency

Experimental Results

How to Access DeepSeek Janus Pro 7B?

Outputs of DeepSeek Janus Pro 7B

Image Description

Text Recognition

Text-To-Image Generation

Limitations and Future Directions

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set