Chinese AI company Kling AI has introduced "Video O1." According to the company, this is the "world's first unified multimodal video model," a system designed to handle both video generation and editing tasks within a single framework.
According to Kling AI, Video O1 integrates several tasks that previously required separate tools. The model can generate videos ranging from three to 10 seconds using prompts or reference images, but it also edits existing footage, like swapping protagonists, changing the weather, or adjusting styles and colors. Video O1 handles these requests in a single prompt, letting users add a subject, modify the background, and change the visual style simultaneously.
Processing multiple inputs simultaneously
The model processes different types of input at the same time, interpreting up to seven images, videos, subjects, and text strings as prompts. Users can edit videos with text commands like "remove passersby" or "change daylight to twilight" without needing manual masking or keyframes.
Users can upload characters, props, or scenes, which the system then uses in different contexts. Actions or camera movements can also serve as references. Kling says the system understands the input data well enough to keep subjects, people, or products consistent across different shots.
Video O1 relies on a multimodal transformer architecture, though the company hasn't shared many details. Kling introduced a "Multimodal Visual Language" (MVL) to act as an interactive bridge between text and multimodal signals. The model uses reasoning chains to deduce events, enabling intelligent video generation that moves beyond simple pattern reconstruction, echoing the kind of language Google used to describe its own recent advancements with Nano Banana Pro.
Internal tests show performance gains over competitors
Kling AI tested Video O1 internally against Google Veo 3.1 and Runway Aleph. In tasks involving video creation from image references, Video O1 reportedly performed far better than Google's "ingredients to video" feature. For video transformations—editing existing videos—evaluators preferred O1 over Runway Aleph in 230 percent of cases. However, these figures come from Kling AI's own internal tests and haven't been verified externally.

O1 is available now via Kling's web interface. While the Chinese company may have taken a step forward with O1, the market remains highly competitive. At almost the same time, Runway unveiled Gen-4.5, its most powerful video model to date. Alongside Western companies like Google, OpenAI, and Midjourney, Kling competes with Chinese rivals such as Hailuo, Seedance, and Vidu, which focus primarily on cost efficiency.