Text-to-video systems transform written words into dynamic visuals. Now, Zeroscope is providing this technology as free software.
Zeroscope stems from Modelscope (demo), a multilevel text-to-video diffusion model with 1.7 billion parameters. It generates video content based on textual descriptions. Zeroscope refines this concept, offering higher resolution, without the Shutterstock watermark, and closer to a 16:9 aspect ratio.
Zeroscope features two components: Zeroscope_v2 567w, designed for rapid content creation in a resolution of 576x320 pixels to explore video concepts. Quality videos can then be upscaled to a "high definition" resolution of 1024x576 using zeroscope_v2 XL. The music in the following demo video was added in post-production.
For video generation, the model requires 7.9 GB of VRam at a resolution of 576x320 pixels with a frame rate of 30 frames per second and 15.3 GB of VRam at a resolution of 1024x576 pixels at the same frame rate. Therefore, the smaller model should operate on many standard graphics cards.
Zeroscope's training involved offset noise applied to 9,923 clips and 29,769 tagged frames, each comprising 24 frames. Offset noise might involve random shifts of objects within video frames, slight changes in frame timings, or minor distortions.
This noise introduction during training enhances the model's understanding of the data distribution. As a result, the model can generate a more diverse range of realistic videos, and more effectively interpret variations in text descriptions.
Could this be an open-source competition for Runway?
According to Zeroscope developer "Cerspense", who has experience with Modelscope, it's not "super hard" to fine-tune a model with 24 GB of VRam. He removed the Modelscope watermarks during the fine-tuning process.
He describes his model as "designed to take on Gen-2," the commercial text-to-video model offered by Runway ML. According to Cerspense, Zeroscope is completely free for public use.
AI artist and developer "dotsimulate" shows more examples of ZeroscopeXL-generated videos in the video below.
Both 567w and Zeroscope v2 XL can be downloaded for free from Hugging Face, which also offers instructions on how to use them. A version of Zeroscope at Colab including a tutorial is available here.
Could text-to-video technology evolve as rapidly as text-to-image?
Text-to-video is still in its infancy. AI-generated clips are typically only a few seconds long and have many visual flaws. However, image AI models initially faced similar issues but achieved photorealism within months. But unlike these models, video generation is far more resource-intensive, both for training and generation.
Google has already unveiled Phenaki and Imagen Video, two text-to-video models capable of generating high-resolution, lengthier, logically coherent clips, though they are not yet released. Meta's Make-a-Video, a text-to-video model, also remains unreleased.
Currently, only Runway's Gen-2 is commercially available, and it is now available on the iPhone. Zeroscope marks the advent of the first high-quality open-source model.