Alexa, Siri, and similar voice assistants have long been touted as the next big thing in computer interfaces, but so far, they have not lived up to expectations. Large language models could change that.
To take Alexa and Co. to the next level, we need (at least) four advances:
- Reliable and flexible speech recognition,
- authentic voice output,
- flexible, consistent conversational interaction
- including task deduction and execution.
Items two through four could be addressed by the new generation of ChatGPT-like language models: They can provide credible, continuous dialog and voice output via speech model APIs, and perform complex tasks via plugins or code. Today's language models are far more competent and flexible than anything Alexa, Siri, and the like offer today.
While tool-based language models are still in their infancy, AI models such as OpenAI's Whisper have made reliable speech recognition a practical reality.
Whisper meets LLaMA
Developer Georgi Gerganov's "LLaMA Voice Chat" offers a taste of a next-generation assistant based entirely on open-source technology. Gerganov has made the OpenAI speech recognition model "Whisper" executable in C/C++ on the Apple Neural Engine. The video below shows it in action on an iPhone 13.
According to Gerganov, his Whisper model is so performant that it runs smoothly on multiple platforms: from iOS to Android to Raspberry Pi, and even in the browser using WebAssembly.
Initially, it is "just" a high-level speech transcription. But it becomes an interface when combined with other software, such as a large language model. For demonstration purposes, Gerganov uses Meta's LLaMA language model, which provides responses to user text transcribed via Whisper.
As a prompt for LLaMA, Gerganov guides the model to be a "helpful, friendly, and honest" assistant who can write well and provide direct, detailed responses. The following video shows the Whisper-LLaMA combination in action.
To become a true assistant, LLaMA would still need reliable access to tools such as smart home interfaces or a browser agent. But with the advent of software like Auto-GPT, this should only be a matter of implementation. Last fall, a developer demonstrated an assistant that could generate images via Stable Diffusion using Whisper and GPT-3.
Gerganov offers his Whisper version as well as the exemplary LLaMA implementations on Github.