Content
summary Summary

HPC-AI Tech has developed a new video AI system that achieves commercial-grade quality at about one-tenth the typical training cost by using new compression methods.

Ad

While language models have become increasingly efficient, video AI still requires substantial GPU resources. Open-Sora 2.0 takes a different approach by trading some resolution for dramatically lower computing needs.

Prompt: "Two women sit on a beige couch in a cozy, warmly lit room with a brick wall backdrop. They engage in a cheerful conversation, smiling and toasting red wine in an intimate medium shot." | Video: HPC-AI Tech

Prompt: "A group of anthropomorphic mushrooms having a disco party in the middle of a dark enchanted forest, with glowing neon lights and exaggerated dance moves, their smooth textures and reflective surfaces emphasizing a comical 3D look." | Video: HPC-AI Tech

Ad
Ad

Prompt: "A tomato surfing on a piece of lettuce down a waterfall of ranch dressing, with exaggerated surfing moves and creamy wave effects to highlight the 3D animated fun." | Video: HPC-AI Tech

The research paper reveals training costs of approximately $200,000 - roughly one-tenth of what systems like Movie Gen or Step-Video-T2V require. Testing indicates quality comparable to commercial systems like Runway Gen-3 Alpha and HunyuanVideo. The team used 224 Nvidia H200 GPUs for training.

Table: Comparison of model, number of GPUs, GPU hours and costs for a single run for video gesture generation models MovieGen, Step-Video-T2V and Open Sora 2.0.
Training cost comparison: Open-Sora 2.0 requires approximately $200,000, compared to $2.5 million for Movie Gen and $1 million for Step-Video-T2V. | Image: HPC-AI Tech

The system achieves its efficiency through three training phases: beginning with low-resolution videos, specializing in image-to-video conversion, and finally fine-tuning for higher resolution. The team further optimized resources by incorporating pre-trained image models like Flux.

Central to the system is the Video DC-AE autoencoder, which delivers superior compression rates compared to existing methods. This innovation makes training 5.2 times faster while improving video generation speed by more than tenfold.

Example video frames for generated videos with two different autoencoder compression rates: upper row low, lower row high compression rate.
Higher compression creates slightly less detailed output but enables substantially faster video generation speeds. | Image: HPC-AI Tech

Open-source system challenges commercial video AI

Open-Sora 2.0 can generate videos from both text descriptions and single images. It includes a motion score feature that lets users control movement intensity in the generated clips.

Recommendation
Image sequence: effect of the motion score on video generation using AI, increasing camera movement and dynamics at higher values.
Higher motion scores result in more dynamic camera movements and increased scene activity. | Image: HPC-AI Tech

The system has notable limitations. Videos can only reach 768x768 pixels in resolution and run for five seconds maximum (128 frames). For comparison, OpenAI's Sora - which shares only its name with this project - can generate 1080p videos lasting up to 20 seconds.

Testing shows the system performing at near-commercial levels across key metrics including visual quality, prompt accuracy, and motion handling. Most notably, Open-Sora 2.0's VBench score now sits just 0.69 percent behind OpenAI's Sora, substantially closing the 4.52 percent gap seen in the previous version.

Bar chart comparing VBench scores for text-to-video models: Total Score, Quality Score and Semantic Score.
Open-Sora 2.0 demonstrates substantial improvements over its previous version while closing the quality gap with commercial video AI systems. | Image: HPC-AI Tech

Open-Sora is now available as open source on GitHub. Like other AI video models, it still faces challenges with occasional artifacts and physics-defying movements. You can watch more examples on the official project page.

AI video generation has become an increasingly competitive field, with Chinese companies leading much of the development. New systems launch almost weekly, including open-source projects like Genmo Mochi 1 and MiniMax Video-01. While these models often show modest benchmark improvements, none has achieved a major breakthrough in overall video quality.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The cost-efficiency strategies of Open-Sora 2.0 echo aspects of the "Deepseek moment" in language models, when improved training methods helped open-source systems achieve commercial-level performance at reduced costs. This could affect pricing throughout the video AI sector, where services like Google's latest model currently require 0.50 cents per second due to intensive computing needs.

However, the performance gap between open-source and commercial video AI remains more significant than in language models, as even industry leaders continue working to solve fundamental technical challenges.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • HPC-AI Tech has developed Open-Sora 2.0, an open-source AI video generation system that delivers quality comparable to commercial offerings at just one-tenth of the cost, with training expenses estimated at around $200,000.
  • The significant cost reduction is achieved through a combination of a three-stage training process, the use of pre-trained image models, and the introduction of a new video autoencoder.
  • In benchmarks, Open-Sora 2.0 performs only 0.69 percent behind OpenAI's Sora, a substantial improvement from its previous version, and its open-source nature could potentially drive down prices in the AI video generation market, mirroring the trend observed with language models.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.