Zeroscope is a free text-to-video model that runs on modern graphics cards

Text-to-video systems transform written words into dynamic visuals. Now, Zeroscope is providing this technology as free software.

Zeroscope stems from Modelscope (demo), a multilevel text-to-video diffusion model with 1.7 billion parameters. It generates video content based on textual descriptions. Zeroscope refines this concept, offering higher resolution, without the Shutterstock watermark, and closer to a 16:9 aspect ratio.

Zeroscope features two components: Zeroscope_v2 567w, designed for rapid content creation in a resolution of 576x320 pixels to explore video concepts. Quality videos can then be upscaled to a "high definition" resolution of 1024x576 using zeroscope_v2 XL. The music in the following demo video was added in post-production.

Video: Zeroscope XL

For video generation, the model requires 7.9 GB of VRam at a resolution of 576x320 pixels with a frame rate of 30 frames per second and 15.3 GB of VRam at a resolution of 1024x576 pixels at the same frame rate. Therefore, the smaller model should operate on many standard graphics cards.

Zeroscope's training involved offset noise applied to 9,923 clips and 29,769 tagged frames, each comprising 24 frames. Offset noise might involve random shifts of objects within video frames, slight changes in frame timings, or minor distortions.

This noise introduction during training enhances the model's understanding of the data distribution. As a result, the model can generate a more diverse range of realistic videos, and more effectively interpret variations in text descriptions.

Could this be an open-source competition for Runway?

According to Zeroscope developer "Cerspense", who has experience with Modelscope, it's not "super hard" to fine-tune a model with 24 GB of VRam. He removed the Modelscope watermarks during the fine-tuning process.

He describes his model as "designed to take on Gen-2," the commercial text-to-video model offered by Runway ML. According to Cerspense, Zeroscope is completely free for public use.

Recommendation

AI in practice

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

AI artist and developer "dotsimulate" shows more examples of ZeroscopeXL-generated videos in the video below.

Both 567w and Zeroscope v2 XL can be downloaded for free from Hugging Face, which also offers instructions on how to use them. A version of Zeroscope at Colab including a tutorial is available here.

Could text-to-video technology evolve as rapidly as text-to-image?

Text-to-video is still in its infancy. AI-generated clips are typically only a few seconds long and have many visual flaws. However, image AI models initially faced similar issues but achieved photorealism within months. But unlike these models, video generation is far more resource-intensive, both for training and generation.

Google has already unveiled Phenaki and Imagen Video, two text-to-video models capable of generating high-resolution, lengthier, logically coherent clips, though they are not yet released. Meta's Make-a-Video, a text-to-video model, also remains unreleased.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Currently, only Runway's Gen-2 is commercially available, and it is now available on the iPhone. Zeroscope marks the advent of the first high-quality open-source model.

Zeroscope is a free text-to-video model that runs on modern graphics cards

Could this be an open-source competition for Runway?

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

Could text-to-video technology evolve as rapidly as text-to-image?

MiniMax's Hailuo 02 tops Google Veo 3 in user benchmarks at much lower video costs

Midjourney launches its first video model, letting users turn images into short animated clips

TikTok lets AI-generated ads take over with new automated video creation tools

"Cat attack" on reasoning model shows how important context engineering is

Apple's claims about large reasoning models face fresh scrutiny from a new study

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

Zeroscope is a free text-to-video model that runs on modern graphics cards

Could this be an open-source competition for Runway?

Could text-to-video technology evolve as rapidly as text-to-image?

Share

Bank details