Generating One-Minute Videos with Test-Time Training

Nitika Sharma Last Updated : 10 Apr, 2025

6 min read

Video generation from text has come a long way, but it still hits a wall when it comes to producing longer, multi-scene stories. While diffusion models like Sora, Veo, and Movie Gen have raised the bar in visual quality, they’re typically limited to clips under 20 seconds. The real challenge? Context. Generating a one-minute, story-driven video from a paragraph of text requires models to process hundreds of thousands of tokens while maintaining narrative and visual coherence. That’s where this new research from NVIDIA, Stanford, UC Berkeley, and others steps in, introducing a technique called Test-Time Training (TTT) to push past current limitations.

What’s the Problem with Long Videos?
How Does TTT (Test-Time Training) Solve the Issue?
Examples of One-Minute Videos with Test-Time Training
How Does it Work?
The Dataset: Tom & Jerry with a Twist
Performance: Does it Actually Work?
Artifacts & Limitations
What Makes This Approach Stand Out
Future Directions
TTT Video Generation vs MoCha vs Goku vs OmniHuman1 vs DreamActor-M1
End Note

What’s the Problem with Long Videos?

Transformers, particularly those used in video generation, rely on self-attention mechanisms. These scale poorly with sequence length due to their quadratic computational cost. Attempting to generate a full minute of high-resolution video with dynamic scenes and consistent characters means juggling over 300,000 tokens of information. That makes the model inefficient and often incoherent over long stretches.

Some teams have tried to circumvent this by using recurrent neural networks (RNNs) like Mamba or DeltaNet, which offer linear-time context handling. However, these models compress context into a fixed-size hidden state, which limits expressiveness. It’s like trying to squeeze an entire movie into a postcard, some details just won’t fit.

How Does TTT (Test-Time Training) Solve the Issue?

This paper comes from the idea of making the hidden state of RNNs more expressive by turning it into a trainable neural network itself. Specifically, the authors propose using TTT layers, essentially small, two-layer MLPs that adapt on the fly while processing input sequences. These layers are updated during inference time using a self-supervised loss, which helps them dynamically learn from the video’s evolving context.

Imagine a model that adapts mid-flight: as the video unfolds, its internal memory adjusts to better understand the characters, motions, and storyline. That’s what TTT enables.

Examples of One-Minute Videos with Test-Time Training

Adding TTT Layers to a Pre-Trained Transformer

Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos with strong temporal consistency and motion smoothness.

Prompt: “Jerry snatches a wedge of cheese and races for his mousehole with Tom in pursuit. He slips inside just in time, leaving Tom to crash into the wall. Safe and cozy, Jerry enjoys his prize at a tiny table, happily nibbling as the scene fades to black.“

Baseline Comparisons

TTT-MLP outperforms all other baselines in temporal consistency, motion smoothness, and overall aesthetics, as measured by human evaluation Elo scores.

Prompt: “Tom is happily eating an apple pie at the kitchen table. Jerry looks longingly wishing he had some. Jerry goes outside the front door of the house and rings the doorbell. While Tom comes to open the door, Jerry runs around the back to the kitchen. Jerry steals Tom’s apple pie. Jerry runs to his mousehole carrying the pie, while Tom is chasing him. Just as Tom is about to catch Jerry, he makes it through the mouse hole and Tom slams into the wall.“

Limitations

The generated one-minute videos demonstrate clear potential as a proof of concept, but still contain notable artifacts.

How Does it Work?

The system starts with a pre-trained Diffusion Transformer model, CogVideo-X 5B, which previously could only generate 3-second clips. The researchers inserted TTT layers into the model and trained them (along with local attention blocks) to handle longer sequences.

To manage cost, self-attention was restricted to short, 3-second segments, while the TTT layers took charge of understanding the global narrative across these segments. The architecture also includes gating mechanisms to ensure TTT layers don’t degrade performance during early training.

They further enhance training by processing sequences bidirectionally and segmenting videos into annotated scenes. For example, a storyboard format was used to describe each 3-second segment in detail, backgrounds, character positions, camera angles, and actions.

The Dataset: Tom & Jerry with a Twist

To ground the research in a consistent, well-understood visual domain, the team curated a dataset from over 7 hours of classic Tom and Jerry cartoons. These were broken down into scenes and finely annotated into 3-second segments. By focusing on cartoon data, the researchers avoided the complexity of photorealism and honed in on narrative coherence and motion dynamics.

Human annotators wrote descriptive paragraphs for each segment, ensuring the model had rich, structured input to learn from. This also allowed for multi-stage training—first on 3-second clips, and progressively on longer sequences up to 63 seconds.

Performance: Does it Actually Work?

Yes, and impressively so. When benchmarked against leading baselines like Mamba 2, Gated DeltaNet, and sliding-window attention, the TTT-MLP model outperformed them by an average of 34 Elo points in a human evaluation across 100 videos.

The evaluation considered:

Text alignment: How well the video follows the prompt
Motion naturalness: Realism in character movement
Aesthetics: Lighting, color, and visual appeal
Temporal consistency: Visual coherence across scenes

TTT-MLP particularly excelled in motion and scene consistency, maintaining logical continuity across dynamic actions—something that other models struggled with.

Artifacts & Limitations

Despite the promising results, there are still artifacts. Lighting may shift inconsistently, or motion may look floaty (e.g., cheese hovering unnaturally). These issues are likely linked to the limitations of the base model, CogVideo-X. Another bottleneck is efficiency. While TTT-MLP is significantly faster than full self-attention models (2.5x speedup), it’s still slower than leaner RNN approaches like Gated DeltaNet. That said, TTT only needs to be fine-tuned—not trained from scratch—making it more practical for many use cases.

What Makes This Approach Stand Out

Expressive Memory: TTT turns the hidden state of RNNs into a trainable network, making it far more expressive than a fixed-size matrix.
Adaptability: TTT layers learn and adjust during inference, allowing them to respond in real time to the unfolding video.
Scalability: With enough resources, this method scales to longer and more complex video stories.
Practical Fine-Tuning: Researchers fine-tune only the TTT layers and gates, which keeps training lightweight and efficient.

Future Directions

The team points out several opportunities for expansion:

Optimizing the TTT kernel for faster inference
Experimenting with larger or different backbone models
Exploring even more complex storylines and domains
Using Transformer-based hidden states instead of MLPs for even more expressiveness

TTT Video Generation vs MoCha vs Goku vs OmniHuman1 vs DreamActor-M1

The table given below explains the difference betweeen this model and other trending video generation models out there:

Model	Core Focus	Input Type	Key Features	How It Differs from TTT
TTT (Test-Time Training)	Long-form video generation with dynamic adaptation	Text storyboard	– Adapts during inference – Handles 60+ sec videos – Coherent multi-scene stories	Designed for long videos; updates internal state during generation for narrative consistency
MoCha	Talking character generation	Text + Speech	– No keypoints or reference images – Speech-driven full-body animation	Focuses on character dialogue & expressions, not full-scene narrative videos
Goku	High-quality video & image generation	Text, Image	– Rectified Flow Transformers – Multi-modal input support	Optimized for quality & training speed; not designed for long-form storytelling
OmniHuman1	Realistic human animation	Image + Audio + Text	– Multiple conditioning signals – High-res avatars	Creates lifelike humans; doesn’t model long sequences or dynamic scene transitions
DreamActor-M1	Image-to-animation (face/body)	Image + Driving Video	– Holistic motion imitation – High frame consistency	Animates static images; doesn’t use text or handle scene-by-scene story generation

End Note

Test-Time Training offers a fascinating new lens for tackling long-context video generation. By letting the model learn and adapt during inference, it bridges a crucial gap in storytelling, a domain where continuity, emotion, and pacing matter just as much as visual fidelity.

Whether you’re a researcher in generative AI, a creative technologist, or a product leader curious about what’s next for AI-generated media, this work is a signpost pointing toward the future of dynamic, coherent video synthesis from text.

Nitika Sharma

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.

Beginner Generative AI Research Paper

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Generating One-Minute Videos with Test-Time Training

Table of contents

What’s the Problem with Long Videos?

How Does TTT (Test-Time Training) Solve the Issue?

Examples of One-Minute Videos with Test-Time Training

How Does it Work?

The Dataset: Tom & Jerry with a Twist

Performance: Does it Actually Work?

Artifacts & Limitations

What Makes This Approach Stand Out

Future Directions

TTT Video Generation vs MoCha vs Goku vs OmniHuman1 vs DreamActor-M1

End Note

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID