Thanks to AI, you can now Whisper to an LLaMA

Midjourney prompted by THE DECODER

Alexa, Siri, and similar voice assistants have long been touted as the next big thing in computer interfaces, but so far, they have not lived up to expectations. Large language models could change that.

To take Alexa and Co. to the next level, we need (at least) four advances:

Reliable and flexible speech recognition,
authentic voice output,
flexible, consistent conversational interaction
including task deduction and execution.

Items two through four could be addressed by the new generation of ChatGPT-like language models: They can provide credible, continuous dialog and voice output via speech model APIs, and perform complex tasks via plugins or code. Today's language models are far more competent and flexible than anything Alexa, Siri, and the like offer today.

While tool-based language models are still in their infancy, AI models such as OpenAI's Whisper have made reliable speech recognition a practical reality.

Whisper meets LLaMA

Developer Georgi Gerganov's "LLaMA Voice Chat" offers a taste of a next-generation assistant based entirely on open-source technology. Gerganov has made the OpenAI speech recognition model "Whisper" executable in C/C++ on the Apple Neural Engine. The video below shows it in action on an iPhone 13.

Video: Gerganov

According to Gerganov, his Whisper model is so performant that it runs smoothly on multiple platforms: from iOS to Android to Raspberry Pi, and even in the browser using WebAssembly.

Initially, it is "just" a high-level speech transcription. But it becomes an interface when combined with other software, such as a large language model. For demonstration purposes, Gerganov uses Meta's LLaMA language model, which provides responses to user text transcribed via Whisper.

As a prompt for LLaMA, Gerganov guides the model to be a "helpful, friendly, and honest" assistant who can write well and provide direct, detailed responses. The following video shows the Whisper-LLaMA combination in action.

Recommendation

AI in practice

OpenAI's new Realtime API lets developers add realistic conversations to their apps

Video: Gerganov

To become a true assistant, LLaMA would still need reliable access to tools such as smart home interfaces or a browser agent. But with the advent of software like Auto-GPT, this should only be a matter of implementation. Last fall, a developer demonstrated an assistant that could generate images via Stable Diffusion using Whisper and GPT-3.

Gerganov offers his Whisper version as well as the exemplary LLaMA implementations on Github.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Thanks to AI, you can now Whisper to an LLaMA

Whisper meets LLaMA

OpenAI's new Realtime API lets developers add realistic conversations to their apps

OmniGen 2 blends image and text generation like GPT-4o, but is open source

OpenAI postpones open-weight AI until summer due to "unexpected and quite amazing" progress

Google releases open-source LMEval to benchmark language and multimodal models

AI coding can make developers slower even if they feel faster

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

"Cat attack" on reasoning model shows how important context engineering is

Thanks to AI, you can now Whisper to an LLaMA

Whisper meets LLaMA

Share

Bank details