Content
summary Summary

Moondream, a startup based in Seattle, has released moondream2, a compact vision language model that performs well in benchmarks despite its small size. The open-source model could pave the way for local image recognition on smartphones.

Ad

In March, the US startup Moondream released moondream2, a vision language model (VLM) that has attracted significant attention. The model can process both text and image inputs, enabling it to answer questions, extract text (OCR), count objects, and classify items. Since its release, regular updates have continued to improve its benchmark performance.

Screenshot des Moondream-Modells
The July version of the Moondream model shows improved OCR and document comprehension capabilities, demonstrated through historical economic data analysis. With DocVQA, TextVQA, and GQA scores exceeding 60%, the locally executable model shows significant progress.

What makes moondream2 remarkable is its compact size: with only 1.6 billion parameters, the model can run not just on cloud servers but also on local computers and even less powerful devices like smartphones or single-board computers.

Despite its small size, it maintains strong performance capabilities, outperforming some competing models many times its size in certain benchmarks. In a comparison of VLMs on mobile devices, researchers highlighted moondream2's performance:

Ad
Ad

 

In particular, we note that although moondream2 has only about 1.7B parameters, its performance is quite comparable to the 7B models. It only falls behind on SQA, the only dataset that provides a related context besides the image and questions for the models to answer the questions effectively. This could indicate that even the strongest smaller models are not able to understand the context.

Murthy et al. in the paper "MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases"

According to developer Vikhyat Korrapati, the model builds on others like SigLIP, Microsoft's Phi-1.5, and the LLaVA training dataset.

moondream2 is developed as open source and is available for free download on GitHub or in a demo on Hugging Face. On the coding platform, it has generated significant interest in the developer community, earning more than 5,000 star ratings.

Millions invested in start-up

This success has attracted investor attention: In a pre-seed funding round led by Felicis Ventures, Microsoft's M12 GitHub Fund, and Ascend, the operators recently raised $4.5 million. CEO Jay Allen, who previously worked at Amazon Web Services for many years, leads the growing company Moondream.

Moondream2 joins a series of specialized and optimized open-source models that deliver similar performance to larger, older models while requiring fewer resources to run. With GOT, researchers recently presented a model trained for OCR tasks. Recently, the startup Useful Sensors also released Moonshine, a promising solution for speech transcription.

Recommendation

This is particularly relevant for smartphones, and open-source advances like moondream2 prove technical feasibility. However, practical application remains complicated for consumers. While there are small on-device models for Apple Intelligence or Google's Gemini Nano, both manufacturers still outsource more complex tasks to the cloud.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Seattle-based start-up Moondream has released moondream2, a compact vision language model (VLM) with just 1.6 billion parameters that, despite its small size, can compete with much larger models in benchmarks.
  • moondream2 accepts not only text but also images as input and can answer questions, extract text (OCR), count or classify things based on it. It is available as open source on GitHub and has caused quite a stir in the developer community.
  • Moondream recently raised $4.5 million in a pre-seed funding round. The model joins the ranks of specialised and optimised open source models that deliver similar performance to larger models while using fewer resources, which is particularly relevant for use on smartphones.
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.