The recent release of the Flux model by Black Forest Labs trended due to its mindblowing image-generation ability. However, it was not portable and, as such, could not be run on an end-user or free-tier machine. This encouraged using it on platforms that provided API services where you do not have to load the model locally but use external API calls. Organizations that prefer to host their models locally will face a high cost for GPUs. Thanks to the Huggingface team, which has added to the Diffusers library support for quantization with BitsAndBytes. This means we can now run Flux inference on a machine with 8GB of GPU RAM.
This article was published as a part of the Data Science Blogathon.
Flux is a series of advanced text-to-image and image-to-image models created by Black Forest Labs, the same team behind Stable Diffusion. It can be viewed as the next step in text-to-image model development, incorporating state-of-the-art technologies. Flux is a successor to Stable Diffusion, which has made multiple improvements in both performance and output quality.
As we mentioned in the introduction, Flux can be quite expensive to run on consumer hardware. However, low GPU users can perform optimizations to run in a more memory-friendly manner. In this article, we will see how Flux can benefit from quantization. Yes, like in quantized gguf files using bits and bytes. Let us see the Creativity against Cost chart from the Lab.
Flux comes in two major variants, Timestep-distilled and Guidance-distilled, but the architecture is built upon several advanced components:
Flux comes in multiple variants:
These features allow Flux to outperform many of its predecessors with a more refined and flexible image-generation experience.
If you’re familiar with running large language models (LLMs) locally, you may have encountered quantization before. Although less commonly used for images, quantization is a powerful technique that reduces a model’s size by storing its parameters in fewer bits, resulting in a smaller memory footprint without sacrificing performance. Typically, neural network parameters are stored in 32 bits (full precision), but quantization can reduce this to as few as 4 bits. This reduction in precision enables large models like Flux to run on consumer-grade hardware.
One key innovation that makes running Flux on an 8GB GPU possible is quantization, powered by the BitsAndBytes library. This library enables accessible large language models via k-bit quantization for PyTorch, offering three main features that dramatically reduce memory consumption for inference and training.
The Diffusers library, which powers image generation models like Flux, recently added support for this quantization technique. As a result, you can now generate complex images directly on your laptop or platforms like Google Colab’s free tier using just 8GB of GPU RAM.
BitsAndBytes is the go-to option for quantizing models to 8 and 4-bit precision. The 8-bit quantization process multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This approach minimizes the degradative effect of outlier values on a model’s performance. The 4-bit quantization compresses the model even further and is commonly used with QLoRA to fine-tune quantized LLMs.
In this guide, we’ll show how you can load and run Flux using 4-bit quantization, drastically reducing memory requirements.
To get started, ensure that your machine is running on a GPU-enabled environment (such as an NVIDIA T4 or L4 GPU). Let’s dive into the technical steps of running Flux on a machine with only 8GB of GPU memory(your free Google Colab!).
!pip install -Uq git+https://github.com/huggingface/diffusers@main
!pip install -Uq git+https://github.com/huggingface/transformers@main
!pip install -Uq bitsandbytes
These packages provide all the tools needed to run Flux memory efficiently, such as loading pre-trained text encoders, handling efficient model loading and CPU offloading, and quantization for running large models on smaller hardware. Next, we import dependencies.
import diffusers
import transformers
import bitsandbytes as bnb
from diffusers import FluxPipeline, FluxTransformer2DModel
from transformers import T5EncoderModel
import torch
import gc
We need all the memory we have. To ensure smooth operation and avoid memory waste, we define a function that clears the GPU memory between model loads. The function below will flush the GPU’s cache and reset memory statistics, ensuring optimal resource usage throughout the notebook.
def flush():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
def bytes_to_giga_bytes(bytes):
return bytes / 1024 / 1024 / 1024
flush()
Flux uses two pre-trained text encoders: CLIP and T5. We’ll only load the T5 encoder to minimise memory usage, using 4-bit quantization. This reduces the memory required by almost 90%.
# Checkpoints
ckpt_id = "black-forest-labs/FLUX.1-dev"
ckpt_4bit_id = "hf-internal-testing/flux.1-dev-nf4-pkg"
prompt = "a cute dog in paris photoshoot"
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
ckpt_4bit_id,
subfolder="text_encoder_2",
)
With the T5 encoder loaded, we can now proceed to the next step: generating text embeddings. This step drastically reduces memory consumption, enabling us to load the encoder on a machine with limited resources.
Now that we have the 4-bit quantized T5 text encoder loaded, we can encode the text prompt. This will convert the input prompt into embeddings, which will later be used to guide the image generation process.
Now, we load the Flux pipeline with only the T5 encoder and enable CPU offloading. This technique helps balance memory usage by moving large parameters that don’t fit in GPU memory onto the CPU.
pipeline = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=text_encoder_2_4bit,
transformer=None,
vae=None,
torch_dtype=torch.float16,
)
with torch.no_grad():
prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
prompt=prompt, prompt_2=None, max_sequence_length=256
)
del pipeline
flush()
After encoding, the prompt embeddings are stored in prompt_embeds, which will condition the model for generating an image. This step converts the prompt into a form that the model can understand and use for image generation.
With the text embeddings ready, we now load the remaining parts of the model: the Transformer and VAE. Both will also be loaded in 4 bits, keeping the overall memory footprint minimal.
transformer_4bit = FluxTransformer2DModel.from_pretrained(ckpt_4bit_id, subfolder="transformer")
pipeline = FluxPipeline.from_pretrained(
ckpt_id,
text_encoder=None,
text_encoder_2=None,
tokenizer=None,
tokenizer_2=None,
transformer=transformer_4bit,
torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()
This step completes the loading of the model, and you’re ready to generate images on an 8GB machine.
print("Running denoising.")
height, width = 512, 768
images = pipeline(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=50,
guidance_scale=5.5,
height=height,
width=width,
output_type="pil",
).images
# Display the image
images[0]
This breakthrough in quantization and efficient model handling brings us closer to the future where powerful AI models can run directly on consumer hardware. No longer do you need access to high-end GPUs or expensive cloud resources or paid serverless API calls. With the improvements in the underlying technology and leveraging quantization techniques like BitsAndBytes, the possibilities for democratized AI are endless. Whether you are a hobbyist, developer, or researcher, these advancements make it easier than ever to create, experiment, and innovate in image generation.
With the introduction of Flux and the clever use of quantization, you can now generate impressive images using hardware as modest as an 8GB GPU. This is a significant step toward making advanced AI accessible to a broader audience, and the technology is only going to get better from here. So grab your laptop, set up Flux, and start creating! While full-precision models demand more memory and resources, techniques such as 4-bit quantization provide a practical solution for deploying large models on constrained systems. This approach can be applied not only to Flux but also to other large models, opening up the possibility of high-quality AI generation on smaller, more affordable hardware setups.
If you are looking for Generative AI course online then explore: GenAI Pinnacle Program
Resources
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Ans. 4-bit quantization reduces the model’s memory footprint, allowing large models like FLUX to run more efficiently on limited resources, such as Colab GPUs.
Ans. Simply replace the prompt variable in the script with any new text description you want the model to visualize. For example, changing it to “A serene landscape with mountains” will generate an image of that scene.
Ans. You can adjust the num_inference_steps (controls the quality) and guidance_scale (controls how strongly the image adheres to the prompt) in the pipeline call. Higher values will result in better quality and more detailed images, but they may also take more time to generate.
Ans. Ensure that you’re running the notebook on a GPU and using the 4-bit quantization and mixed-precision setup. If errors persist, consider lowering the num_inference_steps or running the model in “CPU offload” mode to reduce memory usage.
Ans. Yes, you can run this script on any machine that has Python and the required libraries installed. Ensure that your local machine has sufficient GPU resources and memory if you’re working with large models like FLUX.