Google introduced DataGemma - the world's first open models designed to address AI hallucination by LLMs

DataGemma Release

Making AI more reliable and trustworthy

Google has introduced DataGemma - the world's first open models designed to address AI hallucination by LLMs using real-world statistical data from Google's Data Commons.

Data Commons is a publicly available knowledge graph containing over 240 billion data points from trusted organizations like the UN, WHO, CDC, and Census Bureaus. DataGemma integrates this vast resource with Gemma, Google's family of lightweight, open-source AI models based on Gemini technology.

The Google team used two approaches to enhance LLM factuality and reasoning:

  1. RIG (Retrieval-Interleaved Generation): Proactively queries and fact-checks against Data Commons during response generation. The RIG approach improved factual accuracy from 5-17% (base models) to about 58% (fine-tuned models). 
  2. RAG (Retrieval-Augmented Generation): Incorporates relevant information from Data Commons before generating responses, leveraging Gemini 1.5 Pro's long context window. For the RAG approach, 98-99% of statistical claims were accurate when using Data Commons information.

Google states that preliminary results show improvements in the accuracy of language models when handling numerical facts. This potentially reduces hallucinations in various use cases. The company is planning to refine these methodologies and eventually integrate them into both Gemma and Gemini models through a phased, limited-access approach.

Researchers and developers can access DataGemma through quickstart notebooks for both RIG and RAG approaches.

Was this helpful?
Our Social Media →  
Original data from HuggingFace, OpenCompass and various public git repos.
Release v2024072803