summary Summary

Researchers from Google and Deepmind are experimenting with a large language model for answering lay medical questions. Med-PaLM generates scientifically sound answers at the level of human experts.

The research team is relying on PaLM, Google's large language model with 540 billion parameters, about three times as many as GPT-3. PaLM outperforms GPT-3 on challenging language and code tasks, according to Google, and forms the language portion of the company's Pathways vision. PaLM stands for Pathways Language Model.

With Instruction Prompt Tuning to a medical language model

For the medical variant of PaLM, the research team developed a new prompt method to tune a Flan-PaLM variant to the medical field. Flan-PaLM is a variant of PaLM fine-tuned with instructions for tasks (such as dialogs, FAQs, reasoning), which Google Brain introduced in October.

Instead of fine-tuning PaLM with medical data, which would be more complex, the research team used a combination of soft prompts learned during prompt tuning with a small amount of medical data with prompts written by humans for specific medical responses. For the latter prompts, the research team collaborated with four clinicians from the US and UK.

The research team used the instructions and examples of a panel of qualified clinicians to adapt Flan-PaLM to the medical field. | Image: Google

The researchers named this combination of learned and programmed prompts "Instruction Prompt Tuning." The new method is "data and parameter efficient," the team writes.

To the best of our knowledge, ours is the first published example of learning a soft prompt that is prefixed in front of a full hard prompt containing a mixture of instructions and few-shot examples.

From the paper

The Med-PaLM model resulting from Instruction Prompt Tuning significantly outperforms an unadjusted Flan-PaLM model on medical responses and achieves "encouraging results," according to the research team, but falls short of clinician performance.

Looking at the results, this conclusion is correct, but it also seems like an understatement: Med-PaLM performs equally well as the professionals in almost all tests. The assessment of response quality was also performed by clinicians.

Unlike Flan-PaLM, which is not optimized with medical data, Med-PaLM achieves expert-level results when answering medical questions. | Image: Google

Med-PaLM also provided significantly fewer potentially harmful responses. In Flan-PaLM, 29.7 percent of the responses could have resulted in harm to health. For Med-PaLM, it was only 5.9 percent compared to 5.7 percent for human experts. Again, the medical language model works on par with humans.

Example answers generated by Med-PaLM to medical questions from laypersons. | Image: Google

When evaluated by laypersons, human expert answers were rated as more helpful, but again Med-PaLM performed significantly better than Flan-PaLM. Both language models answered the questions.

Evaluated by laypersons, Med-PaLM optimized with medical data gives significantly more helpful answers than Flan-PaLM not optimized with medical data. | Image: Google

Language models could help medical professionals

Med-PaLM's strong performance on medical questions could be an emergent capability of language models, the researchers write in their conclusion. That's because the model's performance scaled with the number of parameters of the different PaLM models (eight to 540 billion).

However, scaling alone is not sufficient to achieve high response reliability, as demonstrated by the comparatively weak performance of Flan's PaLM model. This is where the newly introduced Instruction Prompt Tuning comes into play.

According to the research team, 92.6% of Med-PaLM answers were judged to be in accordance with the scientific consensus. Clinician responses scored 92.9%, while Flan-PaLM responses scored only 61.9%. This is an indication that Instruction Prompt Tuning is suitable as an alignment technique to generate scientifically sound responses, the team writes.

The Med-PaLM results demonstrate that with instruction prompt tuning we have a data and parameter-efficient alignment technique useful for improving factors related to accuracy, factuality, consistency, safety, harm, and bias, helping close the gap with clinical experts and bringing these models closer to real-world clinical applications.

From the paper

The rise of foundational AI models is a "significant opportunity" to rethink how medical AI is developed and make it "easier, safer and more equitable to use," the researchers write. They see their work as an impetus for further exchange.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Complementing Med-PaLM, the research team is introducing MutliMedQA, a benchmark that combines six existing open-ended datasets for answering questions in the areas of medical exams, research, and consumer inquiries, and HealthSearchQA, a new free-text dataset of medical questions searched online.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Med-PaLM is a large language model optimized with medical data to answer lay medical questions.
  • It generates answers that are qualitatively equivalent to those of clinical professionals.
  • The team also publishes MutliMedQA, a benchmark for medical AI language models.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.