Generating One-Minute Videos with Test-Time Training

Nitika Sharma Last Updated : 10 Apr, 2025
6 min read

Video generation from text has come a long way, but it still hits a wall when it comes to producing longer, multi-scene stories. While diffusion models like Sora, Veo, and Movie Gen have raised the bar in visual quality, they’re typically limited to clips under 20 seconds. The real challenge? Context. Generating a one-minute, story-driven video from a paragraph of text requires models to process hundreds of thousands of tokens while maintaining narrative and visual coherence. That’s where this new research from NVIDIA, Stanford, UC Berkeley, and others steps in, introducing a technique called Test-Time Training (TTT) to push past current limitations.

What’s the Problem with Long Videos?

Transformers, particularly those used in video generation, rely on self-attention mechanisms. These scale poorly with sequence length due to their quadratic computational cost. Attempting to generate a full minute of high-resolution video with dynamic scenes and consistent characters means juggling over 300,000 tokens of information. That makes the model inefficient and often incoherent over long stretches.

Some teams have tried to circumvent this by using recurrent neural networks (RNNs) like Mamba or DeltaNet, which offer linear-time context handling. However, these models compress context into a fixed-size hidden state, which limits expressiveness. It’s like trying to squeeze an entire movie into a postcard, some details just won’t fit.

How Does TTT (Test-Time Training) Solve the Issue?

This paper comes from the idea of making the hidden state of RNNs more expressive by turning it into a trainable neural network itself. Specifically, the authors propose using TTT layers, essentially small, two-layer MLPs that adapt on the fly while processing input sequences. These layers are updated during inference time using a self-supervised loss, which helps them dynamically learn from the video’s evolving context.

Imagine a model that adapts mid-flight: as the video unfolds, its internal memory adjusts to better understand the characters, motions, and storyline. That’s what TTT enables.

Examples of One-Minute Videos with Test-Time Training

Adding TTT Layers to a Pre-Trained Transformer

Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos with strong temporal consistency and motion smoothness.

Prompt: Jerry snatches a wedge of cheese and races for his mousehole with Tom in pursuit. He slips inside just in time, leaving Tom to crash into the wall. Safe and cozy, Jerry enjoys his prize at a tiny table, happily nibbling as the scene fades to black.

Baseline Comparisons

TTT-MLP outperforms all other baselines in temporal consistency, motion smoothness, and overall aesthetics, as measured by human evaluation Elo scores.

Prompt:Tom is happily eating an apple pie at the kitchen table. Jerry looks longingly wishing he had some. Jerry goes outside the front door of the house and rings the doorbell. While Tom comes to open the door, Jerry runs around the back to the kitchen. Jerry steals Tom’s apple pie. Jerry runs to his mousehole carrying the pie, while Tom is chasing him. Just as Tom is about to catch Jerry, he makes it through the mouse hole and Tom slams into the wall.

Limitations

The generated one-minute videos demonstrate clear potential as a proof of concept, but still contain notable artifacts.

How Does it Work?

The system starts with a pre-trained Diffusion Transformer model, CogVideo-X 5B, which previously could only generate 3-second clips. The researchers inserted TTT layers into the model and trained them (along with local attention blocks) to handle longer sequences.

To manage cost, self-attention was restricted to short, 3-second segments, while the TTT layers took charge of understanding the global narrative across these segments. The architecture also includes gating mechanisms to ensure TTT layers don’t degrade performance during early training.

They further enhance training by processing sequences bidirectionally and segmenting videos into annotated scenes. For example, a storyboard format was used to describe each 3-second segment in detail, backgrounds, character positions, camera angles, and actions.

The Dataset: Tom & Jerry with a Twist

To ground the research in a consistent, well-understood visual domain, the team curated a dataset from over 7 hours of classic Tom and Jerry cartoons. These were broken down into scenes and finely annotated into 3-second segments. By focusing on cartoon data, the researchers avoided the complexity of photorealism and honed in on narrative coherence and motion dynamics.

Human annotators wrote descriptive paragraphs for each segment, ensuring the model had rich, structured input to learn from. This also allowed for multi-stage training—first on 3-second clips, and progressively on longer sequences up to 63 seconds.

Performance: Does it Actually Work?

Yes, and impressively so. When benchmarked against leading baselines like Mamba 2, Gated DeltaNet, and sliding-window attention, the TTT-MLP model outperformed them by an average of 34 Elo points in a human evaluation across 100 videos.

The evaluation considered:

  • Text alignment: How well the video follows the prompt
  • Motion naturalness: Realism in character movement
  • Aesthetics: Lighting, color, and visual appeal
  • Temporal consistency: Visual coherence across scenes

TTT-MLP particularly excelled in motion and scene consistency, maintaining logical continuity across dynamic actions—something that other models struggled with.

Artifacts & Limitations

Despite the promising results, there are still artifacts. Lighting may shift inconsistently, or motion may look floaty (e.g., cheese hovering unnaturally). These issues are likely linked to the limitations of the base model, CogVideo-X. Another bottleneck is efficiency. While TTT-MLP is significantly faster than full self-attention models (2.5x speedup), it’s still slower than leaner RNN approaches like Gated DeltaNet. That said, TTT only needs to be fine-tuned—not trained from scratch—making it more practical for many use cases.

What Makes This Approach Stand Out

  • Expressive Memory: TTT turns the hidden state of RNNs into a trainable network, making it far more expressive than a fixed-size matrix.
  • Adaptability: TTT layers learn and adjust during inference, allowing them to respond in real time to the unfolding video.
  • Scalability: With enough resources, this method scales to longer and more complex video stories.
  • Practical Fine-Tuning: Researchers fine-tune only the TTT layers and gates, which keeps training lightweight and efficient.

Future Directions

The team points out several opportunities for expansion:

  • Optimizing the TTT kernel for faster inference
  • Experimenting with larger or different backbone models
  • Exploring even more complex storylines and domains
  • Using Transformer-based hidden states instead of MLPs for even more expressiveness

TTT Video Generation vs MoCha vs Goku vs OmniHuman1 vs DreamActor-M1

The table given below explains the difference betweeen this model and other trending video generation models out there:

Model Core Focus Input Type Key Features How It Differs from TTT
TTT (Test-Time Training) Long-form video generation with dynamic adaptation Text storyboard – Adapts during inference
– Handles 60+ sec videos
– Coherent multi-scene stories
Designed for long videos; updates internal state during generation for narrative consistency
MoCha Talking character generation Text + Speech – No keypoints or reference images
– Speech-driven full-body animation
Focuses on character dialogue & expressions, not full-scene narrative videos
Goku High-quality video & image generation Text, Image – Rectified Flow Transformers
– Multi-modal input support
Optimized for quality & training speed; not designed for long-form storytelling
OmniHuman1 Realistic human animation Image + Audio + Text – Multiple conditioning signals
– High-res avatars
Creates lifelike humans; doesn’t model long sequences or dynamic scene transitions
DreamActor-M1 Image-to-animation (face/body) Image + Driving Video – Holistic motion imitation
– High frame consistency
Animates static images; doesn’t use text or handle scene-by-scene story generation

Also Read:

End Note

Test-Time Training offers a fascinating new lens for tackling long-context video generation. By letting the model learn and adapt during inference, it bridges a crucial gap in storytelling, a domain where continuity, emotion, and pacing matter just as much as visual fidelity.

Whether you’re a researcher in generative AI, a creative technologist, or a product leader curious about what’s next for AI-generated media, this work is a signpost pointing toward the future of dynamic, coherent video synthesis from text.

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details