How to Access DeepSeek Janus Pro 7B?

Pankaj Singh Last Updated : 29 Jan, 2025
9 min read

With the release of DeepSeek V3 and R1, U.S. tech giants are struggling to regain their competitive edge. Now, DeepSeek has introduced Janus Pro, a state-of-the-art multimodal AI that further solidifies its dominance in both understanding and generative AI tasks. Janus Pro outperforms many leading models in multimodal reasoning, text-to-image generation, and instruction-following benchmarks.

Janus Pro, builds upon its predecessor, Janus, by introducing optimized training strategies, expanding its dataset, and scaling its model architecture. These enhancements enable Janus Pro to achieve notable improvements in multimodal understanding and text-to-image instruction-following capabilities, setting a new benchmark in the field of AI. In this article, we will dissect the research paper to help you understand what’s inside DeepSeek Janus Pro and how you can access DeepSeek Janus Pro 7B.

What is DeepSeek Janus Pro 7B?

The DeepSeek Janus Pro 7B is an AI model designed to handle tasks across multiple formats, like text, images, and videos, all in one system. What makes it stand out is its unique design: it separates the processing of visual information into different pathways while using a single transformer framework to bring everything together. This smart setup makes the model more flexible and efficient, whether it’s analyzing content or generating new ideas. Compared to older multimodal AI models, Janus Pro 7B takes a big step forward in both performance and versatility.

  • Optimized Visual Processing: Janus Pro 7B uses separate pathways for handling visual data, like images and videos. This design boosts its ability to understand and process visual tasks more effectively than earlier models.
  • Unified Transformer Design: The model features a streamlined architecture that brings together different types of data (like text and visuals) seamlessly. This improves its ability to both understand and generate content across multiple formats.
  • Open and Accessible: Janus Pro 7B is open source and freely available on platforms like Hugging Face. This makes it easy for developers and researchers to dive in, experiment, and unlock its full potential without restrictions.

Multimodal Understanding and Visual Generation Results

DeepSeek janus pro 7B
Source: DeepSeek Janus Pro Paper

Multimodal Understanding Performance

  • This graph compares average performance across four benchmarks that test a model’s ability to understand both text and visual data.
  • The x-axis represents the number of model parameters (billions), which indicates model size.
  • The y-axis shows average performance across these benchmarks.
  • Janus-Pro-7B is positioned at the top, showing that it outperforms many competing models, including LLaVA, VILA, and Emu3-Chat.
  • The red and green lines indicate different groups of models: the Janus-Pro family (unified models) and the LLaVA family (understanding only).

Instruction-Following for Image Generation

  • This graph evaluates how well models generate images based on text prompts.
  • Two benchmarks are used:
    • GenEval
    • DPG-Bench
  • The y-axis represents accuracy (%).
  • Janus-Pro models (Janus and Janus-Pro-7B) achieve the highest accuracy, surpassing SDXL, DALLE-3, and other vision models.
  • This suggests that Janus-Pro-7B is highly effective at generating images based on text prompts.

In a nutshell, Janus-Pro outperforms both unified multimodal models and specialized models, making it a top-performing AI for both understanding and generating visual content.

Key Takeaways

  1. Janus-Pro-7B excels in multimodal understanding, outperforming competitors.
  2. It also achieves state-of-the-art performance in text-to-image generation, making it a powerful model for creative AI tasks.
  3. Its performance is strong across multiple benchmarks, proving it is a well-rounded AI system.

Key Advancements in Janus Pro

DeepSeek Janus Pro incorporates improvements in four primary areas: training strategies, data scaling, model architecture, and implementation efficiency.

1. Optimized Training Strategy

Janus-Pro refines its training pipeline to address computational inefficiencies observed in Janus:

  • Extended Stage I Training: The initial stage focuses on training adaptors and the image prediction head using ImageNet data. Janus-Pro lengthens this stage, ensuring a robust capability for modeling pixel dependencies, even with frozen language model parameters.
  • Streamlined Stage II Training: Unlike Janus, which allocated a large portion of training to ImageNet data for pixel dependency modeling, Janus-Pro skips this step in Stage II. Instead, it directly trains on dense text-to-image datasets, improving efficiency and performance in generating visually coherent images.
  • Dataset Ratio Adjustments: The supervised fine-tuning phase (Stage III) now uses a balanced multimodal dataset ratio (5:1:4 for multimodal, text, and text-to-image data, respectively). This adjustment maintains robust visual generation while enhancing multimodal understanding.

2. Data Scaling

To boost the multimodal understanding and visual generation capabilities, Janus-Pro significantly expands its dataset:

  • Multimodal Understanding Data: The dataset has grown by 90 million samples, including contributions from YFCC, Docmatix, and other sources. These datasets enrich the model’s ability to handle diverse tasks, from document analysis to conversational AI.
  • Visual Generation Data: Recognizing the limitations of noisy, real-world data, Janus-Pro integrates 72 million synthetic aesthetic samples, achieving a balanced 1:1 real-to-synthetic data ratio. These synthetic samples, curated for quality, accelerate convergence and enhance image generation stability and aesthetics.

3. Model Scaling

Janus-Pro scales the architecture of the original Janus:

  • Larger Language Model (LLM): The model size increases from 1.5 billion parameters to 7 billion, with improved hyperparameters. This scaling enhances both multimodal understanding and visual generation by speeding up convergence and improving generalization.
  • Decoupled Visual Encoding: The architecture employs independent encoders for multimodal understanding and generation. Image inputs are processed by SigLIP for high-dimensional semantic feature extraction, while visual generation utilizes a VQ tokenizer to convert images into discrete IDs.

Detailed Methodology of DeepSeek Janus Pro 7B

1. Architectural Overview

Detailed Methodology of DeepSeek Janus Pro 7B
Source: DeepSeek Janus Pro Paper

Janus-Pro adheres to an autoregressive framework with a decoupled visual encoding approach:

  • Multimodal Understanding: Features are flattened from a 2D grid into a 1D sequence. An adaptor then maps these features into the input space of the LLM.
  • Visual Generation: The VQ tokenizer converts images into discrete IDs. These IDs are flattened and mapped into the LLM’s input space using a generation adaptor.
  • Unified Processing: The multimodal feature sequences are concatenated and processed by the LLM, with separate prediction heads for text and image outputs.

1. Understanding (Processing Images to Generate Text)

This module enables the model to analyze and describe images based on an input query.

How It Works:

  • Input: Image
    • The model takes an image as input.
  • Und. Encoder (Understanding Encoder)
    • Extracts important visual features from the image (such as objects, colors, and spatial relationships).
    • Converts the raw image into a compressed representation that the transformer can understand.
  • Text Tokenizer
    • If a language instruction is provided (e.g., “What is in this image?”), it is tokenized into a numerical format.
  • Auto-Regressive Transformer
    • Processes both image features and text tokens to generate a text response.
  • Text De-Tokenizer
    • Converts the model’s numerical output into human-readable text.

Example:
Input:
An image of a cat sitting on a table + “Describe the image.”
Output: “A small white cat is sitting on a wooden table.”

2. Image Generation (Processing Text to Generate Images)

This module enables the model to create new images from textual descriptions.

How It Works:

  • Input: Language Instruction
    • A user provides a text prompt describing the desired image (e.g., “A futuristic city at night.”).
  • Text Tokenizer
    • The text input is tokenized into numerical format.
  • Auto-Regressive Transformer
    • Predicts the image representation token by token.
  • Gen. Encoder (Generation Encoder)
    • Converts the predicted image representation into a structured format.
  • Image Decoder
    • Generates the final image based on the encoded representation.

Example:
Input:
“A dragon flying over a castle at sunset.”
Output: AI-generated image of a dragon soaring above a medieval castle at sunset.

3. Key Components in the Model

ComponentFunction
Und. EncoderExtracts visual features from input images.
Text TokenizerConverts text input into tokens for processing.
Auto-Regressive TransformerCentral module that handles both text and image generation sequentially.
Gen. EncoderConverts generated image tokens into structured representations.
Image DecoderProduces an image from encoded representations.
Text De-TokenizerConverts generated text tokens into human-readable responses.

4. Why This Architecture?

  • Unified Transformer Model: Uses the same transformer to process both images and text.
  • Sequential Generation: Outputs are generated step-by-step for both images and text.
  • Multi-Modal Learning: Can understand and generate images and text in a single system.

The DeepSeek Janus-Pro model is a powerful vision-language AI system that enables both image comprehension and text-to-image generation. By leveraging auto-regressive learning, it efficiently produces text and images in a structured and scalable manner. 🚀

2. Training Strategy Enhancements

Janus-Pro modifies the three-stage training pipeline:

  • Stage I: Focuses on ImageNet-based pretraining with extended training time.
  • Stage II: Discards ImageNet data in favor of dense text-to-image datasets, improving computational efficiency.
  • Stage III: Adjusts dataset ratios to balance multimodal, text, and text-to-image data.

3. Implementation Efficiency

Janus-Pro utilizes the HAI-LLM framework, leveraging NVIDIA A100 GPUs for distributed training. The entire training process is streamlined, taking 7 days for the 1.5B model and 14 days for the 7B model across multiple nodes.

Experimental Results

Janus-Pro demonstrates significant advancements over previous models:

  • Convergence Speed: Scaling to 7B parameters significantly reduces convergence time for multimodal understanding and visual generation tasks.
  • Improved Visual Generation: Synthetic data enhances text-to-image stability and aesthetics, though fine details (e.g., small facial features) remain challenging due to resolution limitations.
  • Enhanced Multimodal Understanding: Expanded datasets and a refined training strategy improve the model’s ability to comprehend and generate meaningful multimodal outputs.

Model of Janus Series:

Model Sequence Length Download
Janus-1.3B 4096 🤗 Hugging Face
JanusFlow-1.3B 4096 🤗 Hugging Face
Janus-Pro-1B 4096 🤗 Hugging Face
Janus-Pro-7B 4096 🤗 Hugging Face

How to Access DeepSeek Janus Pro 7B?

Firstly, save the below given Python libraries and dependencies under requirements.txt in Google Colab and then run this:

Google Colab
pip install -r /content/requirements.txt
Python libraries and dependencies

followed by the required libraries, use the below code:

import torch
from transformers import AutoConfig, AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
from PIL import Image
# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# # run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
Deepseek janus download

Refer to this for full code with Gradio: deepseek-ai/Janus-Pro-7B

Image

input image

Output

The image contains a logo with a stylized design that includes a circular
pattern resembling a target or a camera aperture. Within this design, there
is a cartoon character with sunglasses and a hand gesture, which appears to
be a playful or humorous representation.

The text next to the logo reads "License to Call." This suggests that the
image is likely related to a service or product that involves calling or
communication, possibly with a focus on licensing or authorization.

The overall design and text imply that the service or product is related to
communication, possibly involving a license or authorization process.

Outputs of DeepSeek Janus Pro 7B

Image Description

DeepSeek Janus-Pro produces an impressive and human-like description with excellent structure, vivid imagery, and strong coherence. Minor refinements could make it even more concise and precise.

Image Description

Text Recognition

Text Recognition

The text recognition output is accurate, clear, and well-structured, effectively capturing the main heading. However, it misses smaller text details and could mention the stylized typography for a richer description. Overall, it’s a strong response but could be improved with more completeness and visual insights.

Text-To-Image Generation

Text-To-Image Generation

A strong and diverse text-to-image generation output with accurate visuals and descriptive clarity. A few refinements, such as fixing text cut-offs and adding finer details, could elevate the quality further.

Checkout our detailed articles on DeepSeek working and comparison with similar models:

Limitations and Future Directions

Despite its successes, Janus-Pro has certain limitations:

  1. Resolution Constraints: The 384 × 384 resolution restricts performance in fine-grained tasks like OCR or detailed image generation.
  2. Reconstruction Loss: The use of the VQ tokenizer introduces reconstruction losses, leading to under-detailed outputs in smaller image regions.
  3. Text-to-Image Challenges: While stability and aesthetics have improved, achieving ultra-high fidelity in generated images remains an ongoing challenge.

Future work could focus on:

  • Increasing image resolution to address fine detail limitations.
  • Exploring alternative tokenization methods to reduce reconstruction losses.
  • Enhancing the training pipeline with adaptive methods for diverse tasks.

Conclusion

Janus-Pro marks a transformative step in multimodal AI. By optimizing training strategies, scaling data, and expanding model size, it achieves state-of-the-art results in multimodal understanding and text-to-image generation. Despite some limitations, Janus-Pro lays a strong foundation for future research in scalable, efficient multimodal AI systems. Its advancements highlight the growing potential of AI to bridge the gap between vision and language, inspiring further innovation in the field.

Ready to dive into the world of DeepSeek? Enroll in our course on accessing DeepSeek Janus Pro 7B today and unlock the power of multimodal AI!

Stay tuned to Analytics Vidhya Blog for more such awesome content!

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Responses From Readers

Clear

egretconsults77
egretconsults77

Asante sana, for this illuminating analysis.

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details