ByteDance, the company behind TikTok, continues to make waves in the AI community, not just for its social media platform but also for its latest research in video generation. After impressing the tech world with their OmniHuman paper, they’ve now released another video generation paper called Goku. Goku AI ia a family of AI models that makes creating stunning, realistic videos and images as simple as typing a few words. Let’s dive deeper into what makes this model special.
Current image and video generation models, while impressive, still face several limitations that Goku aims to address:
The Goku aims to overcome these limitations by focusing on data curation, rectified flow Transformers, and scalable training infrastructure, ultimately pushing the boundaries of what’s possible in joint image and video generation.
Goku is a new family of joint image-and-video generation models based on rectified flow Transformers, designed to achieve industry-grade performance. It integrates advanced techniques for high-quality visual generation, including meticulous data curation, model design, and flow formulation. The core of Goku is the rectified flow (RF) Transformer model, specifically designed for joint image and video generation. It enables faster convergence in joint image and video generation compared to diffusion models.
Key contributions of Goku include:
Goku supports multiple generation tasks, such as text-to-video, image-to-video, and text-to-image generation. It achieves top scores on major benchmarks, including 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. Specifically, the Goku-T2V model achieved a score of 84.85 in VBench, securing the No.2 position as of 2024-10-07.
Goku is trained in multiple stages and operates using a sophisticated Rectified Flow technology to generate high-quality images and videos.
Training Stages:
Goku operates using Rectified Flow technology to enhance AI-generated visuals by making movements more natural and fluid. Unlike traditional models that correct frames step by step (leading to jerky animations), Goku processes entire sequences to ensure continuous, seamless movement.
Additional Training Details:
Using advanced Rectified Flow technology, Goku transforms static images and text prompts into dynamic videos with smooth motion, offering content creators a powerful tool for automated video production
Two women are sitting at a table in a room with wooden walls and a plant in the background. Both women look to the right and talk, with surprised expressions.
Goku is evaluated on text-to-image and text-to-video benchmarks:
Qualitative results demonstrate the superior quality of the generated media samples, underscoring Goku’s effectiveness in multi-modal generation and its potential as a high-performing solution for both research and commercial applications.
Goku achieves top scores on major benchmarks:
Alright, focusing solely on generating content for specific headings using the information you’ve provided.
The Goku framework excels in transforming static images into dynamic video sequences through its Image-to-Video (I2V) capabilities. To achieve this, the Goku-I2V model undergoes fine-tuning from the Text-to-Video (T2V) initialization, utilizing a dataset of approximately 4.5 million text-image-video triplets sourced from diverse domains. This ensures robust generalization across a wide array of visual styles and semantic contexts.
Despite a relatively small number of fine-tuning steps (10,000), the model demonstrates remarkable efficiency in animating reference images. Crucially, the generated videos maintain strong alignment with the accompanying textual descriptions, effectively translating the semantic nuances into coherent visual narratives. The resulting videos exhibit high visual quality and impressive temporal coherence, showcasing Goku’s ability to breathe life into still images while adhering to textual cues.
To provide an intuitive understanding of Goku’s performance, qualitative assessments were conducted, comparing its output with that of both open-source models (such as CogVideoX and Open-Sora-Plan) and closed-source commercial products (including DreamMachine, Pika, Vidu, and Kling). The results highlight Goku’s strengths in handling complex prompts and generating coherent video elements. While certain commercial models often struggle to accurately render details or maintain motion consistency, Goku-T2V (8B) consistently demonstrates superior performance. It excels at incorporating all details from the prompt, creating visual outputs with smooth motion and realistic dynamics.
Two key ablation studies were performed to understand the impact of model scaling and joint training on Goku’s performance:
By comparing Goku-T2V models with 2B and 8B parameters, it was found that increasing model size helps to mitigate the generation of distorted object structures. This observation aligns with findings from other large multi-modality models, indicating that increased capacity contributes to more accurate and realistic visual representations.
The impact of joint image-and-video training was assessed by fine-tuning Goku-T2V (8B) on 480p videos, both with and without joint image-and-video training, starting from the same pretrained Goku-T2I (8B) weights. The results demonstrated that Goku-T2V trained without joint training tended to generate lower-quality video frames. In contrast, the model with joint training more consistently produced photorealistic frames, highlighting the importance of this approach for achieving high visual fidelity in video generation.
Goku emerges as a powerful force in the landscape of generative AI, demonstrating the potential of rectified flow Transformers to bridge the gap between text and vivid visual realities. From its meticulously curated datasets to its scalable training infrastructure, every aspect of Goku is engineered for peak performance. While the journey of AI-driven content creation is far from over, Goku marks a significant leap forward, paving the way for more intuitive, accessible, and breathtakingly realistic visual experiences in the years to come. It’s not just about generating images and videos; it’s about unlocking new creative possibilities for everyone.
A. Goku is a family of joint image-and-video generation models leveraging rectified flow Transformers.
A. The key components are data curation, model architecture design, flow formulation, and training infrastructure optimization.
A. Goku excels in GenEval, DPG-Bench for text-to-image generation, and VBench for text-to-video tasks.
A. The training dataset comprises approximately 36M video-text pairs and 160M image-text pairs.
A. Rectified flow is a formulation used for joint image and video generation, implemented through the Goku model family.