Content
summary Summary

Researchers have developed a method for generating longer, more coherent AI videos that tell complex stories.

Ad

While AI video generation has improved significantly in recent months, length limitations have remained a persistent challenge. OpenAI's Sora maxes out at 20 seconds, Meta's MovieGen at 16 seconds, and Google's Veo 2 at just 8 seconds. Now, a team from Nvidia, Stanford University, UCSD, UC Berkeley, and UT Austin has introduced a solution: Test-Time Training layers (TTT layers) that enable videos up to one minute long.

The fundamental issue with existing models stems from their "self-attention" mechanism in Transformer architectures. This approach requires each element in a sequence to relate to every other element, causing computational requirements to increase quadratically with length. For minute-long videos containing over 300,000 tokens, this becomes computationally prohibitive.

Recurrent neural networks (RNNs) offer a potential alternative since they process data sequentially and store information in a "hidden state," with computational demands that scale linearly with sequence length. However, traditional RNNs struggle to capture complex relationships over extended sequences due to their architecture.

Ad
Ad

How TTT layers transform video generation

The researchers' innovation replaces simple hidden states in conventional RNNs with small neural networks that continuously learn during the video generation process. These TTT layers work alongside the attention mechanism.

During each processing step, the mini-network trains to recognize and reconstruct patterns in the current image section. This creates a more sophisticated memory system that better maintains consistency across longer sequences - ensuring rooms and characters remain consistent throughout multiple scenes. A similar test-time training approach showed success in the ARC-AGI benchmark in late 2024, though that implementation relied on LoRAs.

Image: Dalal, Koceja, Hussein, Xu et al.

The team demonstrated their approach using Tom and Jerry cartoons. Their dataset includes approximately seven hours of cartoon footage with detailed human descriptions.

Image: Dalal, Koceja, Hussein, Xu et al.

Users can describe their video ideas with varying levels of specificity:

Recommendation
  1. A short summary in 5-8 sentences (e.g., "Tom happily eats an apple pie at the kitchen table. Jerry looks on longingly...")
  2. A more detailed plot of about 20 sentences, with each sentence corresponding to a 3-second segment
  3. A comprehensive storyboard where each 3-second segment is described by a paragraph of 3-5 sentences detailing background, characters, and camera movements

Extending video length by 20 times

The researchers built upon CogVideo-X, a pre-trained model with 5 billion parameters that originally generated only 3-second clips. By integrating TTT layers, they progressively trained it to handle longer durations - from 3 seconds to 9, 18, 30, and finally 63 seconds.

The computationally expensive self-attention mechanisms only apply to 3-second segments, while the more efficient TTT layers operate globally across the entire video, keeping computational requirements manageable. Each video is generated by the model in a single pass, without subsequent editing or montage. The resulting videos tell coherent stories spanning multiple scenes.

Despite these advances, the model still has limitations - objects sometimes change at segment transitions, float unnaturally, or experience abrupt lighting changes.

All information, examples, and comparisons with other methods are available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have developed a new method, called Test-Time Training Layers (TTT-Layers), that can significantly extend AI-generated videos - from a maximum of 20 seconds for leading models to up to 63 seconds.
  • The innovation combines transformer models with recurrent neural networks, whereby the TTT layers continuously learn as the video is generated, developing a better "memory" for longer sequences without increasing computational complexity by the square.
  • As a proof of concept, the researchers extended the CogVideo-X model with their TTT layers and trained it with Tom and Jerry cartoons, allowing users to enter their video ideas at three different levels of detail - from short summaries to detailed storyboards.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.