Content
summary Summary

U.S. startup Useful Sensors has developed Moonshine, an open-source speech recognition model that processes audio more efficiently than OpenAI's Whisper while using fewer computing resources.

Ad

The company says it designed Moonshine specifically for real-time applications on hardware with limited resources. Moonshine's main advantage is its flexible architecture. While Whisper processes all audio in fixed 30-second segments regardless of length, Moonshine adjusts its processing time based on actual audio duration, making it particularly efficient for shorter clips.

Comparison of three waveform visualizations with positional embeddings: zero-padding (5.21% WER), prefix truncation (107.38% WER), suffix truncation (18.45% WER).
The visualization shows how zero-padding, prefix, and suffix truncation affect speech processing accuracy. Word Error Rate (WER) data reveals significant performance differences between these methods. | Image: Useful Sensors

The model comes in two sizes. The smaller Tiny version features 27.1 million parameters, while the larger Base version uses 61.5 million parameters. For comparison, OpenAI's equivalent models are larger: Whisper tiny.en uses 37.8 million parameters, and base.en 72.6 million parameters.

Testing shows the Tiny model matches its Whisper counterpart's accuracy while consuming less computing power. Both Moonshine versions maintained lower word error rates than Whisper during tests, even with varying audio levels and background noise.

Ad
Ad
Two tables: Word error rate (WER) comparison between Moonshine and Whisper in Base and Tiny variants across eight different language datasets.
Overall, Moonshine is slightly ahead of Whisper on speech recognition benchmarks, but it's more efficient. | Image: Useful Sensors

The researchers identified one area for improvement: very short audio clips under one second, which made up a small portion of the training data. Adding more short segments to the training set could improve the model's performance with these clips.

Offline capabilities open new doors

By operating efficiently without an internet connection, Moonshine enables applications that weren't feasible before due to hardware constraints. While Whisper runs on standard computers, it demands too much power for smartphones and small devices like Raspberry Pi computers. Useful Sensors uses Moonshine for Torre, its English-Spanish translator.

The code for Moonshine is available on Github. Users should note that AI transcription systems, like LLMs, can hallucinate. Researchers at Cornell University found that Whisper created non-existent content about 1.4 percent of the time, with higher error rates for people with speech disorders such as aphasia. Other researchers and developers report much higher hallucination rates.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • US-startup Useful Sensors has developed Moonshine, an open-source speech recognition model optimized for real-time applications on resource-constrained hardware, achieving up to five times faster performance than OpenAI's Whisper.
  • Moonshine scales processing time proportionally to audio input length, eliminates overhead from zero-padding shorter data, and maintains Whisper-like accuracy at a reduced computational cost, despite having a smaller model size.
  • Moonshine models show strong performance in benchmarks, slightly outperforming Whisper in word error rate, but with room for improvement on very short audio segments. It's available as open source.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.