Google's DataGemma aims to ground language models in reality and curb AI hallucinations

Google has unveiled DataGemma, a set of open models designed to improve the accuracy and reliability of large-scale language models by grounding them in real-world data.

According to Google, the development of DataGemma is in response to the persistent challenge of hallucinations in LLMs, where AI models sometimes present very convincingly inaccurate information.

DataGemma uses Google's Data Commons, a publicly available knowledge graph that contains more than 240 billion global data points from verified sources such as the United Nations, the World Health Organization, and various statistical agencies.

RIG recognizes and checks statistics

The models differ mainly in two well-known approaches: Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG).

RIG uses a fine-tuned Gemma-2 model to identify statistics within the responses and combines them with a retrieval from the Data Commons, allowing the model to check its output against a trusted source.

"For example, instead of saying "The population of California is 39 million," the model would output "The population of California is [DC(What is the population of California?) → '39 million']," Google explains.

RAG retrieves relevant information

In the RAG method, a Gemma model, also tuned for this use case, first analyzes the user's question and converts it into a form that Data Commons can understand. The information from this query is used to enrich the original question before a larger language model - such as Gemini 1.5 Pro, as suggested by Google - generates the final answer.

The challenge with this method is that the data returned by Data Commons can include a large number of tables spanning several years. In Google's test, entries were on average 38,000 tokens long, with a maximum of 348,000 tokens.

According to Google the implementation of RAG is only possible because of Gemini 1.5 Pro's long context window, which allows the team to augment the user query with the extensive data from the Data Commons. Gemini 1.5 Pro's context window holds up to 1.5 million tokens, the next largest of the commercially available language models would be Claude 3 with 200,000.

Recommendation

AI research

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Image: Google

Both approaches have advantages and disadvantages

Both approaches have their strengths and weaknesses. According to the Google researchers, RIG works effectively in all contexts, but does not allow LLM to learn data from data commons that have been added since fine-tuning. In addition, fine-tuning requires specific datasets tailored to the task at hand.

RAG, on the other hand, automatically benefits from the ongoing development of new models, but can sometimes lead to a less intuitive user experience depending on the user prompt.

Google has made the models available for download on Hugging Face and Kaggle(RIG, RAG), along with Quickstart notebooks for both the RIG and RAG approaches.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Google's DataGemma aims to ground language models in reality and curb AI hallucinations

RIG recognizes and checks statistics

RAG retrieves relevant information

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Both approaches have advantages and disadvantages

Google outlines MIRAS and Titans, a possible path toward continuously learning AI

Altman memo: new OpenAI model coming next week, outperforming Gemini 3

Chatbots are now rivaling social networks as a core layer of internet infrastructure

OpenAI's new ChatGPT image model matches Google's Nano Banana Pro on complex prompts

More AI agents isn't always better, new Google and MIT study finds

GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1

Google's DataGemma aims to ground language models in reality and curb AI hallucinations

RIG recognizes and checks statistics

RAG retrieves relevant information

Both approaches have advantages and disadvantages

Share

Bank details