Goku AI: Is This the Future of AI-Generated Video?

Ayushi Trivedi Last Updated : 12 Feb, 2025

7 min read

ByteDance, the company behind TikTok, continues to make waves in the AI community, not just for its social media platform but also for its latest research in video generation. After impressing the tech world with their OmniHuman paper, they’ve now released another video generation paper called Goku. Goku AI ia a family of AI models that makes creating stunning, realistic videos and images as simple as typing a few words. Let’s dive deeper into what makes this model special.

Limitations of Existing Models
Goku: Flow Based Video Generative Foundation Models
Model Training and Working of Goku
Working Mechanism
Videos Generated by Goku
Performance Evaluation
Image-to-Video (I2V) Generation: Animating Stills with Textual Guidance
Qualitative Analysis: Goku vs. The Competition
Ablation Studies: Understanding the Impact of Key Design Choices
Conclusion
Frequently Asked Questions

Limitations of Existing Models

Current image and video generation models, while impressive, still face several limitations that Goku aims to address:

Data Dependency & Quality: Many models are heavily reliant on large, high-quality datasets, and their performance can suffer significantly when trained on data with biases, noise, or limited diversity.
Computational Cost: Training state-of-the-art generative models requires substantial computational resources, making them inaccessible to many researchers and practitioners.
Cross-Modal Consistency: Ensuring coherence between text prompts and generated visuals, especially in complex scenes and dynamic videos, remains a challenge. Existing models often struggle with maintaining consistency in style, background, and object relationships throughout a video sequence.
Fine-Grained Detail & Realism: While overall visual quality has improved, generating fine-grained details and achieving photorealistic results, particularly in areas like textures, lighting, and human anatomy, still poses a hurdle.
Temporal Coherence: Generating videos with smooth, realistic motion and consistent scene dynamics remains a difficult problem. Many models produce videos with temporal flickering, unnatural movements, or abrupt scene transitions.
Limited Control & Editability: Existing models often provide limited control over the generated content, making it difficult to precisely edit or customize the output to specific requirements.
Scalability Challenges: Scaling models to handle longer videos, higher resolutions, and more complex scenarios introduces significant architectural and training challenges.
Joint Image-and-Video Generation: Creating models that excel at both image and video generation while maintaining consistency and coherence between the two modalities is still an open research area.

The Goku aims to overcome these limitations by focusing on data curation, rectified flow Transformers, and scalable training infrastructure, ultimately pushing the boundaries of what’s possible in joint image and video generation.

Goku: Flow Based Video Generative Foundation Models

Goku is a new family of joint image-and-video generation models based on rectified flow Transformers, designed to achieve industry-grade performance. It integrates advanced techniques for high-quality visual generation, including meticulous data curation, model design, and flow formulation. The core of Goku is the rectified flow (RF) Transformer model, specifically designed for joint image and video generation. It enables faster convergence in joint image and video generation compared to diffusion models.

Key contributions of Goku include:

High-quality fine-grained image and video data curation
The use of rectified flow for enhanced interaction among video and image tokens
Superior qualitative and quantitative performance in both image and video generation tasks

Goku supports multiple generation tasks, such as text-to-video, image-to-video, and text-to-image generation. It achieves top scores on major benchmarks, including 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. Specifically, the Goku-T2V model achieved a score of 84.85 in VBench, securing the No.2 position as of 2024-10-07.

Model Training and Working of Goku

Goku is trained in multiple stages and operates using a sophisticated Rectified Flow technology to generate high-quality images and videos.

Training Stages:

Text-Semantic Pairing: Goku is initially pretrained on text-to-image tasks. This stage is critical for establishing a solid understanding of text-to-image relationships and enabling the model to associate textual prompts with high-level visual semantics.
Image-and-Video Joint Learning: Building on the text-to-semantic pairing, Goku extends to joint learning across both image and video data, leveraging a global attention mechanism adaptable to both images and videos. During this stage, a cascade resolution strategy is employed where training initially occurs on low-resolution data and is progressively increased to higher resolutions.
Modality-Specific Finetuning: In the final stage, the team fine-tunes Goku for each specific modality to enhance its output quality further. They make image-centric adjustments for text-to-image generation and focus on improving temporal smoothness, motion continuity, and stability across frames for text-to-video generation.

Working Mechanism

Goku operates using Rectified Flow technology to enhance AI-generated visuals by making movements more natural and fluid. Unlike traditional models that correct frames step by step (leading to jerky animations), Goku processes entire sequences to ensure continuous, seamless movement.

Image Analysis: The AI examines depth, lighting, and object placement.
Motion Dynamics Application: The system applies motion dynamics to predict how different elements should move in a realistic setting.
Frame Interpolation: Frame interpolation fills in the missing visuals, ensuring that animations appear natural rather than artificially generated.
Audio Synchronization (if applicable): If an audio file is provided, the AI refines its motion synchronization, creating videos that match sound patterns accurately.

Additional Training Details:

Flow-Based Formulation: Goku adopts a flow-based formulation rooted in the rectified flow (RF) algorithm, which progressively transforms a sample from a prior distribution to the target data distribution through linear interpolations.
Infrastructure Optimization: MegaScale’s advanced parallelism strategies, fine-grained Activation Checkpointing, and fault tolerance mechanisms enable scalable and efficient training of Goku. ByteCheckpoint efficiently saves and loads training states.
Data Curation: Rigorous data curation is applied to collect raw image and video data from various sources. The final training dataset consists of approximately 160M image-text pairs and 36M video-text pairs.

Videos Generated by Goku

Using advanced Rectified Flow technology, Goku transforms static images and text prompts into dynamic videos with smooth motion, offering content creators a powerful tool for automated video production

Turn Product Image To Video Clip

Product and Human Interaction

Advertising Scenario

Text to Video

Two women are sitting at a table in a room with wooden walls and a plant in the background. Both women look to the right and talk, with surprised expressions.

Performance Evaluation

Goku is evaluated on text-to-image and text-to-video benchmarks:

Text-to-Image Generation: Goku-T2I demonstrates strong performance across multiple benchmarks, including T2I-CompBench, GenEval, and DPG-Bench, excelling in both visual quality and text-image alignment.
Text-to-Video Benchmarks: Goku-T2V achieves state-of-the-art performance on the UCF-101 zero-shot generation task and attains a score of 84.85 on VBench, securing the top position on the leaderboard (as of 2025-01-25). As of 2024-10-07, Goku-T2V achieved a score of 84.85 in VBench, securing the No.2 position.

Qualitative results demonstrate the superior quality of the generated media samples, underscoring Goku’s effectiveness in multi-modal generation and its potential as a high-performing solution for both research and commercial applications.

model performance comparison — Source: Goku AI Paper

Goku achieves top scores on major benchmarks:

0.76 on GenEval (text-to-image generation)
83.65 on DPG-Bench (text-to-image generation)
84.85 on VBench (text-to-video generation)

Alright, focusing solely on generating content for specific headings using the information you’ve provided.

Image-to-Video (I2V) Generation: Animating Stills with Textual Guidance

The Goku framework excels in transforming static images into dynamic video sequences through its Image-to-Video (I2V) capabilities. To achieve this, the Goku-I2V model undergoes fine-tuning from the Text-to-Video (T2V) initialization, utilizing a dataset of approximately 4.5 million text-image-video triplets sourced from diverse domains. This ensures robust generalization across a wide array of visual styles and semantic contexts.

Despite a relatively small number of fine-tuning steps (10,000), the model demonstrates remarkable efficiency in animating reference images. Crucially, the generated videos maintain strong alignment with the accompanying textual descriptions, effectively translating the semantic nuances into coherent visual narratives. The resulting videos exhibit high visual quality and impressive temporal coherence, showcasing Goku’s ability to breathe life into still images while adhering to textual cues.

Qualitative Analysis: Goku vs. The Competition

To provide an intuitive understanding of Goku’s performance, qualitative assessments were conducted, comparing its output with that of both open-source models (such as CogVideoX and Open-Sora-Plan) and closed-source commercial products (including DreamMachine, Pika, Vidu, and Kling). The results highlight Goku’s strengths in handling complex prompts and generating coherent video elements. While certain commercial models often struggle to accurately render details or maintain motion consistency, Goku-T2V (8B) consistently demonstrates superior performance. It excels at incorporating all details from the prompt, creating visual outputs with smooth motion and realistic dynamics.

Ablation Studies: Understanding the Impact of Key Design Choices

Two key ablation studies were performed to understand the impact of model scaling and joint training on Goku’s performance:

Model Scaling

By comparing Goku-T2V models with 2B and 8B parameters, it was found that increasing model size helps to mitigate the generation of distorted object structures. This observation aligns with findings from other large multi-modality models, indicating that increased capacity contributes to more accurate and realistic visual representations.

Joint Training

The impact of joint image-and-video training was assessed by fine-tuning Goku-T2V (8B) on 480p videos, both with and without joint image-and-video training, starting from the same pretrained Goku-T2I (8B) weights. The results demonstrated that Goku-T2V trained without joint training tended to generate lower-quality video frames. In contrast, the model with joint training more consistently produced photorealistic frames, highlighting the importance of this approach for achieving high visual fidelity in video generation.

Conclusion

Goku emerges as a powerful force in the landscape of generative AI, demonstrating the potential of rectified flow Transformers to bridge the gap between text and vivid visual realities. From its meticulously curated datasets to its scalable training infrastructure, every aspect of Goku is engineered for peak performance. While the journey of AI-driven content creation is far from over, Goku marks a significant leap forward, paving the way for more intuitive, accessible, and breathtakingly realistic visual experiences in the years to come. It’s not just about generating images and videos; it’s about unlocking new creative possibilities for everyone.

Key Takeaways

Goku employs a comprehensive data processing pipeline for high-quality datasets.
The model utilizes rectified flow formulation for joint image and video generation.
A robust infrastructure supports large-scale training of Goku.
Goku demonstrates competitive performance on text-to-image and text-to-video benchmarks.

Frequently Asked Questions

Q1. What is Goku?

A. Goku is a family of joint image-and-video generation models leveraging rectified flow Transformers.

Q2. What are the key components of Goku?

A. The key components are data curation, model architecture design, flow formulation, and training infrastructure optimization.

Q3. What benchmarks does Goku excel in?

A. Goku excels in GenEval, DPG-Bench for text-to-image generation, and VBench for text-to-video tasks.

Q4. What is the size of the training dataset?

A. The training dataset comprises approximately 36M video-text pairs and 160M image-text pairs.

Q5. What is rectified flow?

A. Rectified flow is a formulation used for joint image and video generation, implemented through the Goku model family.

Ayushi Trivedi

My name is Ayushi Trivedi. I am a B. Tech graduate. I have 3 years of experience working as an educator and content editor. I have worked with various python libraries, like numpy, pandas, seaborn, matplotlib, scikit, imblearn, linear regression and many more. I am also an author. My first book named #turning25 has been published and is available on amazon and flipkart. Here, I am technical content editor at Analytics Vidhya. I feel proud and happy to be AVian. I have a great team to work with. I love building the bridge between the technology and the learner.

Beginner Generative AI Research Paper

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to CNN and implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Experiments with Generative Adversarial Networks

Zero and Few Shot Learning

Model Deployment

Goku AI: Is This the Future of AI-Generated Video?

Table of contents

Limitations of Existing Models

Goku: Flow Based Video Generative Foundation Models

Model Training and Working of Goku

Working Mechanism

Videos Generated by Goku

Turn Product Image To Video Clip

Product and Human Interaction

Advertising Scenario

Text to Video

Performance Evaluation

Image-to-Video (I2V) Generation: Animating Stills with Textual Guidance

Qualitative Analysis: Goku vs. The Competition

Ablation Studies: Understanding the Impact of Key Design Choices

Model Scaling

Joint Training

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)