Generative foundation models have revolutionized Natural Language Processing (NLP), with Large Language Models (LLMs) excelling across diverse tasks. However, the field of visual generation still lacks a unified model capable of handling multiple tasks within a single framework. Existing models like Stable Diffusion, DALL-E, and Imagen excel in specific domains but rely on task-specific extensions such as ControlNet or InstructPix2Pix, which limit their versatility and scalability.
OmniGen addresses this gap by introducing a unified framework for image generation. Unlike traditional diffusion models, OmniGen features a concise architecture comprising only a Variational Autoencoder (VAE) and a transformer model, eliminating the need for external task-specific components. This design allows OmniGen to handle arbitrarily interleaved text and image inputs, enabling a wide range of tasks such as text-to-image generation, image editing, and controllable generation within a single model.
OmniGen not only excels in benchmarks for text-to-image generation but also demonstrates robust transfer learning, emerging capabilities, and reasoning across unseen tasks and domains.
In this section, we will look into the OmniGen framework, focusing on its model design principles, architecture, and innovative training strategies.
Current diffusion models often face limitations, restricting their usability to specific tasks, such as text-to-image generation. Extending their functionality usually involves integrating additional task-specific networks, which are cumbersome and lack reusability across diverse tasks. OmniGen addresses these challenges by adhering to two core design principles:
OmniGen adopts an innovative architecture that integrates a Variational Autoencoder (VAE) and a pre-trained large transformer model:
Unlike conventional diffusion models that rely on separate encoders (e.g., CLIP or image encoders) for preprocessing input conditions, OmniGen inherently encodes all conditional information, significantly simplifying the pipeline. It also jointly models text and images within a single framework, enhancing interaction between modalities.
OmniGen accepts free-form multimodal prompts, interleaving text and images:
The attention mechanism is a game-changer in AI, enabling models to focus on the most relevant data while processing complex tasks. From powering transformers to revolutionizing NLP and computer vision, this concept has redefined efficiency and precision in machine learning systems.
OmniGen modifies the standard causal attention mechanism to enhance image modeling:
The inference process is where AI models apply learned patterns to new data, transforming training into actionable predictions. It’s the final step that bridges model training with real-world applications, driving insights and automation across industries.
OmniGen uses a flow-matching method for inference:
OmniGen employs the rectified flow approach for optimization, which differs from traditional DDPM methods. It interpolates linearly between noise and data, training the model to directly regress target velocities based on noised data, timestep, and condition information.
The training objective minimizes a weighted mean squared error loss, emphasizing regions where changes occur in image editing tasks to prevent the model from overfitting to unchanged areas.
OmniGen progressively trains at increasing image resolutions, balancing data efficiency with aesthetic quality.
Training details, including resolution, steps, batch size, and learning rate, are outlined below:
Stage | Image Resolution | Training Steps(K) | Batch Size | Learning Rate |
1 | 256×256 | 500 | 1040 | 1e-4 |
2 | 512×512 | 300 | 520 | 1e-4 |
3 | 1024×1024 | 100 | 208 | 4e-5 |
4 | 2240×2240 | 30 | 104 | 2e-5 |
5 | Multiple | 80 | 104 | 2e-5 |
Through its innovative architecture and efficient training methodology, OmniGen sets a new benchmark in diffusion models, enabling versatile and high-quality image generation for a wide range of applications.
To enable robust multi-task processing in image generation, constructing a large-scale and diverse foundation was essential. OmniGen achieves this by redefining how models approach versatility and adaptability across various tasks.
Key innovations include:
Through these advancements, OmniGen sets a benchmark for achieving unified and intelligent image generation capabilities, bridging gaps between diverse tasks and paving the way for groundbreaking applications.
OmniGen is easy to get started with, whether you’re working in a local environment or using Google Colab. Follow the instructions below to install and use OmniGen for generating images from text or multi-modal inputs.
To install OmniGen, start by cloning the GitHub repository and installing the package:
Clone the OmniGen repository:
git clone https://github.com/VectorSpaceLab/OmniGen.git
cd OmniGen
pip install -e
pip install OmniGen
Optional: If you prefer to avoid conflicts, create a dedicated environment:
# Create a Python 3.10.13 conda environment (you can also use virtualenv)
conda create -n omnigen python=3.10.13
conda activate omnigen
# Install PyTorch with the appropriate CUDA version (e.g., cu118)
pip install torch==2.3.1+cu118 torchvision --extra-index-url https://download.pytorch.org/whl/cu118
!pip install OmniGen
# Clone and install OmniGen
git clone https://github.com/VectorSpaceLab/OmniGen.git
cd OmniGen
pip install -e .
Once OmniGen is installed, you can start generating images. Below are examples of how to use the OmniGen pipeline.
OmniGen allows you to generate images from text prompts. Here’s a simple example to generate an image of a man drinking tea:
from OmniGen import OmniGenPipeline
pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1")
# Generate an image from text
images = pipe(
prompt='''Realistic photo. A young woman sits on a sofa,
holding a book and facing the camera. She wears delicate
silver hoop earrings adorned with tiny, sparkling diamonds
that catch the light, with her long chestnut hair cascading
over her shoulders. Her eyes are focused and gentle, framed
by long, dark lashes. She is dressed in a cozy cream sweater,
which complements her warm, inviting smile. Behind her, there
is a table with a cup of water in a sleek, minimalist blue mug.
The background is a serene indoor setting with soft natural light
filtering through a window, adorned with tasteful art and flowers,
creating a cozy and peaceful ambiance. 4K, HD''',
height=1024,
width=1024,
guidance_scale=2.5,
seed=0,
)
images[0].save("example_t2i.png") # Save the generated image
images[0].show()
You can also use OmniGen for multi-modal generation, where text and images are combined. Here’s an example where an image is included as part of the input:
# Generate an image with text and a provided image
images = pipe(
prompt="<img><|image_1|><img>\n Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola.
.",
input_images=["./imgs/demo_cases/edit.png
"],
height=1024,
width=1024,
guidance_scale=2.5,
img_guidance_scale=1.6,
seed=0
)
images[0].save("example_ti2i.png") # Save the generated image
The following example demonstrates OmniGen’s advanced Computer Vision (CV) capabilities, specifically its ability to detect and render the human skeleton from an image input. This task combines textual instructions with an image to produce accurate skeleton detection results.
from PIL import Image
# Define the prompt for skeleton detection
prompt = "Detect the skeleton of human in this image: <img><|image_1|><img>"
input_images = ["./imgs/demo_cases/edit.png"]
# Generate the output image with skeleton detection
images = pipe(
prompt=prompt,
input_images=input_images,
height=1024,
width=1024,
guidance_scale=2,
img_guidance_scale=1.6,
seed=333
)
# Save and display the output
images[0].save("./imgs/demo_cases/skeletal.png")
# Display the input image
print("Input Image:")
for img in input_images:
Image.open(img).show()
# Display the output image
print("Output:")
images[0].show()
This example demonstrates OmniGen’s subject-driven ability to identify individuals described in a prompt from multiple input images and generate a group image of these subjects. The process is end-to-end, requiring no external recognition or segmentation, showcasing OmniGen’s flexibility in handling complex multi-source scenarios.
from PIL import Image
# Define the prompt for subject-driven generation
prompt = (
"A professor and a boy are reading a book together. "
"The professor is the middle man in <img><|image_1|></img>. "
"The boy is the boy holding a book in <img><|image_2|></img>."
)
input_images = ["./imgs/demo_cases/AI_Pioneers.jpg", "./imgs/demo_cases/same_pose.png"]
# Generate the output image with described subjects
images = pipe(
prompt=prompt,
input_images=input_images,
height=1024,
width=1024,
guidance_scale=2.5,
img_guidance_scale=1.6,
separate_cfg_infer=True,
seed=0
)
# Save and display the generated image
images[0].save("./imgs/demo_cases/entity.png")
# Display input images
print("Input Images:")
for img in input_images:
Image.open(img).show()
# Display the output image
print("Output:")
images[0].show()
Subject-Driven Ability: Our model can identify the described subject in multi-person images and generate group images of individuals from multiple sources. This end-to-end process requires no additional recognition or segmentation, highlighting OmniGen’s flexibility and versatility.
The versatility of OmniGen opens up numerous applications across different fields:
As OmniGen continues to evolve, future iterations may expand its capabilities further, potentially incorporating more advanced reasoning mechanisms and enhancing its performance on complex tasks.
OmniGen is a revolutionary image generation model that combines text and image inputs into a unified framework, overcoming the limitations of existing models like Stable Diffusion and DALL-E. By integrating a Variational Autoencoder (VAE) and a transformer model, it simplifies workflows while enabling versatile tasks such as text-to-image generation and image editing. With capabilities like multi-modal generation, subject-driven customization, and few-shot learning, OmniGen opens new possibilities in fields like generative art and data augmentation. Despite some limitations, such as challenges with long text inputs and fine details, OmniGen is set to shape the future of visual content creation, offering a powerful, flexible tool for diverse applications.
A. OmniGen is a unified image generation model designed to handle a variety of tasks, including text-to-image generation, image editing, and multi-modal generation (combining text and images). Unlike traditional models, OmniGen does not rely on task-specific extensions, offering a more versatile and scalable solution.
A. OmniGen stands out due to its simple architecture, which combines a Variational Autoencoder (VAE) and a transformer model. This allows it to process both text and image inputs in a unified framework, enabling a wide range of tasks without requiring additional components or modifications.
A. To run OmniGen efficiently, a system with a CUDA-enabled GPU is recommended. The model has been trained on A800 GPUs, and the inference process benefits from GPU acceleration using key-value cache mechanisms.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.