Content
summary Summary

A new open-source AI model from Chinese startup Moonshot AI processes images, text, and videos with surprising efficiency. Kimi-VL stands out for its ability to handle long documents, complex reasoning, and user interface understanding.

Ad

According to Moonshot AI, Kimi-VL uses a mixture-of-experts architecture, activating only part of the model for each task. With just 2.8 billion active parameters—far fewer than many large models—Kimi-VL delivers results comparable to much bigger systems across various benchmarks.

The model boasts a maximum context window of 128,000 tokens, enough to process an entire book or lengthy video transcript. Moonshot AI reports that Kimi-VL consistently scores well on tests like LongVideoBench and MMLongBench-Doc.

Triple image analysis: aerial comparison of urban locations, Toronto skyline with Rogers Centre, futuristic cyberpunk game scene with neon lighting.
Kimi-VL demonstrates strong visual analysis across a range of scenarios. | Image: Moonshot A

Kimi-VL's image processing capabilities are notable. Unlike some systems, it can analyze complete screenshots or complex graphics without breaking them into smaller pieces. The model also handles mathematical image problems and handwritten notes. In one test, it analyzed a handwritten manuscript, identified references to Albert Einstein, and explained their relevance.

Ad
Ad
Twelve-step guide to activating a Do Not Track feature in Google Chrome
The assistant guides users step by step, from initial setup to activating data protection features. | Image: Moonshot AI

The system also functions as a software assistant, interpreting graphical user interfaces and automating digital tasks. Moonshot AI claims that in tests where the model navigated browser menus or changed settings, it outperformed many other systems, including GPT-4o.

Compact design, competitive results

Compared to other open-source models like Qwen2.5-VL-7B and Gemma-3-12B-IT, Kimi-VL appears more efficient. According to Moonshot AI, it leads in 19 out of 24 benchmarks, despite running with far fewer active parameters. On MMBench-EN and AI2D, it reportedly matches or beats scores usually seen from larger, commercial models.

Scatter plot: Comparison of AI models based on activated parameters and MathVision performance, with Kimi-VL-Thinking leading the way in efficiency.
Kimi-VL-Thinking, with just 2.8 billion active parameters, outperforms larger models on the MathVision benchmark. | Image: Moonshot AI

The company attributes much of this performance to its training approach. Beyond standard supervised fine-tuning, Kimi-VL uses reinforcement learning. A specialized version called Kimi-VL-Thinking was trained to run through longer reasoning steps, boosting performance on tasks that require more complex thought, such as mathematical reasoning.

Multiple bar charts compare the performance of AI models such as Kimi-VL-A3B and Qwen2.5-VL-7B in various benchmark tests.
Kimi-VL-A3B achieves high scores on video and document processing tasks. | Image: Moonshot AI

Kimi-VL isn't without constraints. Its current size limits its performance on highly language-intensive or niche tasks, and it still faces technical challenges with very long contexts, even with the expanded context window.

Moonshot AI says it plans to develop larger model versions, incorporate more training data, and improve fine-tuning. The company's stated long-term goal is to create a "powerful yet resource-efficient system" suitable for real-world use in research and industry.

Recommendation

Earlier this year, Moonshot AI released Kimi k1.5, a multimodal model for complex reasoning that the company claims holds its own against GPT-4o in benchmarks. Kimi k1.5 is available on the kimi.ai web interface. A demo of Kimi-VL can be found on Hugging Face.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Kimi-VL, an open-source AI model from Moonshot AI, processes images, text, and videos more efficiently than larger competitors, outperforming similar models in 19 out of 24 benchmarks with just 2.8 billion active parameters.
  • The model features a context window of 128,000 tokens, allowing it to handle entire books, long videos, high-resolution images without splitting, mathematical image tasks, and handwritten note recognition.
  • Kimi-VL uses a mixture-of-experts architecture and advanced training methods like supervised fine-tuning and reinforcement learning, making it especially effective as a software assistant for interpreting graphical user interfaces and automating digital tasks.
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.