Retrieval-augmented generation (RAG) promises to help medical AI systems deliver up-to-date and reliable answers. But a new review shows that, so far, RAG rarely works as intended in real-world healthcare settings—and technical, regulatory, and infrastructure hurdles are slowing its adoption.
Large language models have proven powerful across a range of fields and are already in use in many industries. Medicine, however, is a different story. Here, accuracy, timeliness, and transparency are non-negotiable. This is exactly where traditional LLMs run into problems: they can hallucinate facts, lack access to the latest research, and are hard to audit. RAG aims to address these issues. But despite recent technical advances, the technology has yet to make its mark in clinical practice.
A recent overview paper—featuring contributors from the University of Geneva, the University of Tokyo, the Duke-NUS Medical School in Singapore, and several Chinese research institutions—breaks down why this is the case, and what would need to change.
RAG provides current information—at least in theory
At its core, retrieval-augmented generation is simple: instead of relying solely on the model's static knowledge, the system pulls in external sources—like medical guidelines, research papers, or electronic health records—to answer questions. These documents are retrieved, ranked by relevance, and then fed into the language model along with the original query.
In practice, though, things get complicated. According to the researchers, the specialized language of medicine, the wide variety of data formats, and the high stakes for accuracy create unique challenges for every module of a RAG system—from the retriever that gathers external data, to the re-ranker that sorts it by importance, to the generator that creates the final answer.
The technology is there, but real-world use is limited
The review highlights a number of RAG systems that have shown promise in research, including medical question-answering tools, systems for diagnosing rare diseases, and automated radiology report generators. RAG approaches are also being tested in genomics and personalized patient communication.
Even so, real-world deployment in hospitals remains rare. The main reason, the authors say, is that these systems are complex, expensive, and often not robust enough for safety-critical environments. Regulatory uncertainty and privacy concerns also make it hard to integrate them into everyday clinical workflows.
Five obstacles blocking clinical adoption
The paper identifies five key challenges:
- Trustworthiness: Faulty sources or poor re-ranking decisions can produce dangerous misinformation.
- Multilingual support: Nearly all current systems only work in English. Other languages lack suitable models and datasets.
- Multimodality: Much of medical data isn't text—it comes as images, time series, or audio. RAG systems that can reliably handle these formats are rare.
- Computing power: Large models like DeepSeek require hundreds of GPUs—unrealistic for most hospitals.
- Data privacy: Handling sensitive patient data with cloud-based LLMs often conflicts with regulations like GDPR or HIPAA.
Some solutions are already being explored: smaller, locally-run models, hybrid systems that combine local retrieval with external generation, and domain-specific models like MedCPT. But the researchers point out that these approaches come with trade-offs—such as lower accuracy or new privacy risks.
Another barrier was recently identified in a separate study: humans themselves. Patients who interact with chatbots tend to perform significantly worse on medical benchmarks than chatbots on their own.