Content
summary Summary

Alexa, Siri, and similar voice assistants have long been touted as the next big thing in computer interfaces, but so far, they have not lived up to expectations. Large language models could change that.

To take Alexa and Co. to the next level, we need (at least) four advances:

  • Reliable and flexible speech recognition,
  • authentic voice output,
  • flexible, consistent conversational interaction
  • including task deduction and execution.

Items two through four could be addressed by the new generation of ChatGPT-like language models: They can provide credible, continuous dialog and voice output via speech model APIs, and perform complex tasks via plugins or code. Today's language models are far more competent and flexible than anything Alexa, Siri, and the like offer today.

While tool-based language models are still in their infancy, AI models such as OpenAI's Whisper have made reliable speech recognition a practical reality.

Ad
Ad

Whisper meets LLaMA

Developer Georgi Gerganov's "LLaMA Voice Chat" offers a taste of a next-generation assistant based entirely on open-source technology. Gerganov has made the OpenAI speech recognition model "Whisper" executable in C/C++ on the Apple Neural Engine. The video below shows it in action on an iPhone 13.

Video: Gerganov

According to Gerganov, his Whisper model is so performant that it runs smoothly on multiple platforms: from iOS to Android to Raspberry Pi, and even in the browser using WebAssembly.

Initially, it is "just" a high-level speech transcription. But it becomes an interface when combined with other software, such as a large language model. For demonstration purposes, Gerganov uses Meta's LLaMA language model, which provides responses to user text transcribed via Whisper.

As a prompt for LLaMA, Gerganov guides the model to be a "helpful, friendly, and honest" assistant who can write well and provide direct, detailed responses. The following video shows the Whisper-LLaMA combination in action.

Recommendation

Video: Gerganov

To become a true assistant, LLaMA would still need reliable access to tools such as smart home interfaces or a browser agent. But with the advent of software like Auto-GPT, this should only be a matter of implementation. Last fall, a developer demonstrated an assistant that could generate images via Stable Diffusion using Whisper and GPT-3.

Gerganov offers his Whisper version as well as the exemplary LLaMA implementations on Github.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Large language models enable realistic conversations with computers.
  • A developer demonstrates a lightweight Whisper implementation using an LLaMA language model.
  • Combined with more credible speech synthesis and access to language model tools, Alexa and co. could see a resurgence.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.