summary Summary

Text-to-video systems transform written words into dynamic visuals. Now, Zeroscope is providing this technology as free software.

Zeroscope stems from Modelscope (demo), a multilevel text-to-video diffusion model with 1.7 billion parameters. It generates video content based on textual descriptions. Zeroscope refines this concept, offering higher resolution, without the Shutterstock watermark, and closer to a 16:9 aspect ratio.

Zeroscope features two components: Zeroscope_v2 567w, designed for rapid content creation in a resolution of 576x320 pixels to explore video concepts. Quality videos can then be upscaled to a "high definition" resolution of 1024x576 using zeroscope_v2 XL. The music in the following demo video was added in post-production.

Video: Zeroscope XL


For video generation, the model requires 7.9 GB of VRam at a resolution of 576x320 pixels with a frame rate of 30 frames per second and 15.3 GB of VRam at a resolution of 1024x576 pixels at the same frame rate. Therefore, the smaller model should operate on many standard graphics cards.

Zeroscope's training involved offset noise applied to 9,923 clips and 29,769 tagged frames, each comprising 24 frames. Offset noise might involve random shifts of objects within video frames, slight changes in frame timings, or minor distortions.

This noise introduction during training enhances the model's understanding of the data distribution. As a result, the model can generate a more diverse range of realistic videos, and more effectively interpret variations in text descriptions.

Could this be an open-source competition for Runway?

According to Zeroscope developer "Cerspense", who has experience with Modelscope, it's not "super hard" to fine-tune a model with 24 GB of VRam. He removed the Modelscope watermarks during the fine-tuning process.

He describes his model as "designed to take on Gen-2," the commercial text-to-video model offered by Runway ML. According to Cerspense, Zeroscope is completely free for public use.


AI artist and developer "dotsimulate" shows more examples of ZeroscopeXL-generated videos in the video below.

Both 567w and Zeroscope v2 XL can be downloaded for free from Hugging Face, which also offers instructions on how to use them. A version of Zeroscope at Colab including a tutorial is available here.

Could text-to-video technology evolve as rapidly as text-to-image?

Text-to-video is still in its infancy. AI-generated clips are typically only a few seconds long and have many visual flaws. However, image AI models initially faced similar issues but achieved photorealism within months. But unlike these models, video generation is far more resource-intensive, both for training and generation.

Google has already unveiled Phenaki and Imagen Video, two text-to-video models capable of generating high-resolution, lengthier, logically coherent clips, though they are not yet released. Meta's Make-a-Video, a text-to-video model, also remains unreleased.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Currently, only Runway's Gen-2 is commercially available, and it is now available on the iPhone. Zeroscope marks the advent of the first high-quality open-source model.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Zeroscope, a free and open-source software, leverages text-to-video technology to transform written descriptions into high-quality videos. It refines upon Modelscope, offering improved resolution, no watermarks, and closer to a 16:9 aspect ratio compared to the base model.
  • The software comprises two components: Zeroscope_v2 567w for rapid content creation in a lower resolution and zeroscope_v2 XL for upscaling content to a high-definition resolution.
  • Zeroscope serves as a potential open-source competitor to commercial models like Runway's Gen-2. It represents the beginning of high-quality, open-source text-to-video models, a technology still in its early stages but with the potential for rapid evolution similar to text-to-image models.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.