Thanks to AI, you can now Whisper to an LLaMA

Apr 18, 2023

Midjourney prompted by THE DECODER

Alexa, Siri, and similar voice assistants have long been touted as the next big thing in computer interfaces, but so far, they have not lived up to expectations. Large language models could change that.

To take Alexa and Co. to the next level, we need (at least) four advances:

Reliable and flexible speech recognition,
authentic voice output,
flexible, consistent conversational interaction
including task deduction and execution.

Items two through four could be addressed by the new generation of ChatGPT-like language models: They can provide credible, continuous dialog and voice output via speech model APIs, and perform complex tasks via plugins or code. Today's language models are far more competent and flexible than anything Alexa, Siri, and the like offer today.

While tool-based language models are still in their infancy, AI models such as OpenAI's Whisper have made reliable speech recognition a practical reality.

Whisper meets LLaMA

Developer Georgi Gerganov's "LLaMA Voice Chat" offers a taste of a next-generation assistant based entirely on open-source technology. Gerganov has made the OpenAI speech recognition model "Whisper" executable in C/C++ on the Apple Neural Engine. The video below shows it in action on an iPhone 13.

Video: Gerganov

According to Gerganov, his Whisper model is so performant that it runs smoothly on multiple platforms: from iOS to Android to Raspberry Pi, and even in the browser using WebAssembly.

Initially, it is "just" a high-level speech transcription. But it becomes an interface when combined with other software, such as a large language model. For demonstration purposes, Gerganov uses Meta's LLaMA language model, which provides responses to user text transcribed via Whisper.

As a prompt for LLaMA, Gerganov guides the model to be a "helpful, friendly, and honest" assistant who can write well and provide direct, detailed responses. The following video shows the Whisper-LLaMA combination in action.

Video: Gerganov

To become a true assistant, LLaMA would still need reliable access to tools such as smart home interfaces or a browser agent. But with the advent of software like Auto-GPT, this should only be a matter of implementation. Last fall, a developer demonstrated an assistant that could generate images via Stable Diffusion using Whisper and GPT-3.

Gerganov offers his Whisper version as well as the exemplary LLaMA implementations on Github.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Thanks to AI, you can now Whisper to an LLaMA

Whisper meets LLaMA

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.