DataGemma: Grounding LLMs Against Hallucinations

aditi3807991 20 Sep, 2024
13 min read

Introduction

Large Language Models are rapidly transforming industries—today, they power everything from personalized customer service in banking to real-time language translation in global communication. They can answer questions in natural language, summarize information, write essays, generate code, and much more, making them invaluable tools in today’s world. But despite their many advantages, they suffer from a critical flaw known as “hallucination”. These are instances when the model generates information that appears to be correct and realistic but is either partially or totally false, made up by the model and lacks any grounding in real-world data. Thus to tackle this, Google has developed an open model, a tool called DataGemma to connect LLMs with real-world data and fact-check their responses with trusted sources using Google’s Data Commons. 

Learning Outcomes

  • Understand the basics of Large Language Models (LLMs) and their applications.
  • Explore the causes and types of hallucinations in LLMs.
  • Learn how Google’s DataGemma tackles LLM hallucinations using real-world data.
  • Gain insights into advanced techniques like Retrieval-Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG).
  • Discover how Google’s Data Commons improves LLM factual accuracy.

This article was published as a part of the Data Science Blogathon.

Understanding Large Language Models

Large Language Models are foundation models, trained on huge amounts of textual data with parameters ranging from millions to billions, that can understand and generate natural language. They are built on a transformer architecture that allows processing and generating natural language. An LLM model can be fine-tuned for specific tasks in specific domains by using customized datasets. For example, an LLM model like BERT can be fine-tuned on cybersecurity corpora to automate threat intelligence using LLMs. Some popular LLM models are GPT-4 by OpenAI, BERT and Gemini by Google, LLaMA by Meta, Claude by Anthropic etc. 

Comparison of Gemma , Gemini and BERT

GEMMA GEMINI BERT
Lightweight model for developers Larger and more powerful, conversational AI Pre-trained model for NLP tasks 
Ideal for applications with resource constraints like mobile phones & edge computing Ideal for complex tasks with no resource constraints like large-scale data analysis, complex AI applications. Ideal for tasks like text classification, question answering, sentiment analysis.
Easy to deploy in limited resources environment Often deployed in cloud environments or data centers with abundant resources. Deployed both on-premise or in cloud environments, but larger versions (like BERT-Large) require significant computational resources
Requires less computational resources Often requires more computational resources. Smaller models like BERT-Base can be deployed on moderate hardware, while larger models like BERT-Large may need more resources, but still less than Gemini.

Understanding Architecture of Gemma

The architecture of Gemma is designed to seamlessly integrate advanced retrieval and generation techniques, allowing the system to intelligently access external data sources while producing accurate, coherent responses, making it highly effective for various AI-driven applications.

Gemma is based on the transformer decoder architecture: 

Understanding Architecture of Gemma

Gemma and Gemma 2 (the latest version released in 2024) belong to the Gemma family of Google’s LLM models. They can be fine-tuned for customized tasks. For example: CodeGemma models are fine-tuned Gemma models for code completion.

What are Hallucinations in Context of LLMs?

Hallucinations in LLMs are instances where the model confidently generates output which is inaccurate, inconsistent or made up information but it appears believable to us. The model hallucinates content and that content is actually not true. For example: in a court case, two lawyers cited sources provided by ChatGPT which turned out to be false.

AI Hallucinations can be of three types

  • Input conflicting hallucinations: The model generates an output that deviates from the information provided by the user in the input.
  • Context conflicting hallucinations: Here, the model generates an output contradicting it’s previously generated outputs.
  • Fact-conflicting hallucinations: Model generates false/inaccurate output that contradicts with real-world knowledge or facts.

What Causes Hallucinations? 

  • Limited training data: When the model hasn’t been trained thoroughly or is trained on limited data, when it encounters a prompt different from it’s training data, even though it did not understand fully the new prompt, it might produce data based on it’s existing training data leading to inaccuracies.
  • Overfitting: When too many features are provided, the model will try to capture all the data points without understanding the underlying patterns and then get 100% accuracy on training data, but it won’t generalize well on new data.

As you can see, hallucinated LLM content can be harmful if used without fact-checking. In applications where factual accuracy is important and there can’t be any misinformation, like medical advice or legal guidance, hallucinations can lead to misinformation with potentially serious consequences. Hallucinations are delivered as confidently as correct answers, thus it can become difficult for users to recognise it. Also, as the reliance on AI for accurate information is rising, hallucinations can reduce trust in AI systems, making it harder for LLMs to be accepted in high-stakes domains.

Thus, model developers need to tackle this problem and ensure that in cases involving accuracy and facts, the LLM should generate correct, factual output to avoid the spread of misinformation. One such approach to tackle AI Hallucinations has been developed by Google in the form of DataGemma. 

What is DataGemma?

DataGemma is an open model developed by Google to connect LLMs with trust-worthy, factual, real-world data sourced from Google’s DataCommons. 

DataGemma

Google Data Commons is an open repository that combines a vast amount of public datasets into a unified format, making it easier to access and use. It combines data from a variety of sources, including government papers, research organizations, and global databases. The primary purpose of Data Commons is to provide a common framework for various datasets, allowing users to query and analyze structured real-world data across numerous domains without requiring pricey data cleaning or integration efforts.

Key Features of Data Commons

  • It includes data on a variety of topics such as demographics, economics, environment, and healthcare, sourced from places like the U.S. Census Bureau, World Bank, NOAA, and more.
  • The data is organized into a standardized schema, so users can easily query datasets without needing to deal with the complexities of different data formats and structures.
  • Developers can access Data Commons through APIs.
  • It’s a public service that is free to use, designed to make high-quality, reliable data accessible to everyone.

Importance of Data Commons

  • Researchers can use the Data Commons to quickly gather and analyze large, structured datasets without needing to source and clean the data manually.
  • Large Language Models (LLMs), like Google’s Gemma, can use Data Commons to reference real-world data, reducing hallucinations and improving factual accuracy in their outputs.
Importance of Data Commons: DataGemma

Link: Build your own Data Commons – Data Commons

RIG: A Hybrid Approach for Minimizing LLM Hallucinations

It is an advanced technique in natural language processing (NLP) that combines retrieval-based and generation-based methods to improve the quality and relevance of responses.

Here’s a brief explanation of how RIG works: 

  • Retrieval-Based Methods: These methods involve searching a large database of pre-existing responses or documents to find the most relevant information. This approach ensures that the responses are accurate and grounded in real data.
  • Generation-Based Methods: These methods use models to generate responses from scratch based on the input. This allows for more flexible and creative responses but can sometimes lead to inaccuracies or hallucinations.
  • Interleaving: By interleaving or combining retrieval and generation techniques, RIG utilizes the strengths of both approaches. The system retrieves relevant information and then uses a generative model to refine and expand upon it, ensuring accuracy and creativity.

This is useful in applications where high-quality, contextually relevant responses are crucial, such as in conversational AI, customer support, and content creation. 

In DataGemma, Gemma 2 is fine-tuned to recognize when to extract accurate information while generating an output. In this, it replaces the numbers generated in output, with more precise information from Data Commons. Thus, basically the model double-checks its output with a more trusted source. 

How RIG is used in DataGemma? 

In DataGemma, Retrieval-Interleaved Generation (RIG) is leveraged to enhance the accuracy and relevance of outputs by combining the strengths of both retrieval and generative models, ensuring that generated content is grounded in reliable data from trusted sources like Data Commons.

DataGemma
  • First, the user submits a query to the LLM model. In our case, the LLM model is DataGemma, which is based on Gemma 2 model with 27B parameters, fine-tuned for RIG.
  • The DataGemma model generates a response in the form of a natural language query. The purpose of this is to retrieve relevant data from Data Commons’ natural language interface.
  • Data Commons is queried, and the required data is retrieved.
  • The final response is generated and shown to the user. The response includes data, the source information along with its link, and some metadata. This replaces the potentially inaccurate numbers in original response.

Step by Step Procedure on Google Colab

Let us now implement RIG for minimizing hallucination.

Pre-requisites:

  • A100 GPU
  • High-RAM runtime 
  • Hugging Face Token

Step1: Login to your hugging face account and create a new token

Click here to login hugging face account.

Step1: Login to your hugging face account and create a new token

Create New Token:

 Create new token
copy your token: DataGemma

Step2: DataCommons API Key

 New App in Data Commons: DataGemma

Step3: Enable Data Commons NL API

Go to your Colab notebook Secrets section. Create new secret and enable notebook access. 

Enable API
  • HF_TOKEN with value as your Hugging Face token
  • DC_API_KEY with value as your Data Commons token
 Secrets to enter tokens

Step4: Install Required Libraries

Let us install required libraries.

#install the following required libraries 
!pip install -q git+https://github.com/datacommonsorg/llm-tools
!pip install -q bitsandbytes accelerate

#load the finetuned Gemma2 27B model 

import torch

import data_gemma as dg

from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Initialize Data Commons API client
DC_API_KEY = userdata.get('DC_API_KEY')
dc = dg.DataCommons(api_key=DC_API_KEY)


# Get finetuned Gemma2 model from HuggingFace
HF_TOKEN = userdata.get('HF_TOKEN')

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
datagemma_model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=nf4_config,
                                             torch_dtype=torch.bfloat16,
                                             token=HF_TOKEN)

# Build the LLM Model stub to use in RIG flow
datagemma_model_wrapper = dg.HFBasic(datagemma_model, tokenizer)

Step5: Pick or Enter a Query

In this step, users can either select a pre-defined query or input a custom query, enabling the system to retrieve relevant information from the data sources for further processing.

 Secrets to enter tokens

Step6: Run the RIG technique and Generate Output

In this step, the RIG technique is executed, combining retrieval and generation methods to produce a precise and contextually relevant output based on the input query.

from IPython.display import Markdown
import textwrap

def display_chat(prompt, text):
  formatted_prompt = "<font size='+1' color='brown'>🙋‍♂️<blockquote>" + prompt + "</blockquote></font>"
  text = text.replace('•', '  *')
  text = textwrap.indent(text, '> ', predicate=lambda _: True)
  formatted_text = "<font size='+1' color='teal'>🤖\n\n" + text + "\n</font>"
  return Markdown(formatted_prompt+formatted_text)

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))


ans = dg.RIGFlow(llm=datagemma_model_wrapper, data_fetcher=dc, verbose=False).query(query=QUERY)
Markdown(textwrap.indent(ans.answer(), '> ', predicate=lambda _: True))


display_chat(QUERY, ans.answer())

Output: (for a different query)

 Output for Query 2

Conclusion: Gemma2 generates only a numerical value while DataGemma generates the numerical value along with its source information, source links, some meta data and conclusion for the query. 

Source: Google Colab notebook provided by Google

Retrieval Augmented Generation for Minimizing LLM Hallucinations

Retrieval Augmented Generation is an approach in natural language processing (NLP) and large language models (LLMs) to improve the factual accuracy and relevance of the generated content by allowing the model to access external knowledge sources during the generation process. It retrieves relevant information from Data Commons before the LLM generates text, providing it with a factual foundation for its response. 

Here’s a brief explanation of how RAG works: 

  • Retrieval: When the user enters a query, the model receives it and then extracts the relevant data from its knowledge base or external sources.
  • Augmentation: This external information is then used to “augment” (or enhance) the input context for the language model, helping it generate more contextually relevant responses.
  • Generation: The LLM generates a response based on both the original query and the retrieved information.

How RAG is Used in DataGemma?

In DataGemma, Retrieval-Augmented Generation (RAG) is employed to enhance response accuracy by retrieving relevant data from external sources and then generating content that combines this retrieved knowledge with AI-generated insights, ensuring high-quality and contextually relevant outputs.

How RAG is Used in DataGemma?

Here’s how RAG works:

  • First, the user submits a query to the LLM model. In our case, the LLM model is DataGemma, which is based on Gemma 2 model with 27B parameters, fine-tuned for RAG task.
  • The DataGemma model generates a response, after analyzing the input query, in the form of a natural language query. The purpose of this is to retrieve relevant data from Data Commons’ natural language interface.
  • Data Commons is queried and the required information is retrieved.
  • The final response is generated and shown to the user. This includes data tables, the source information along with its link, and some metadata. This replaces the potentially inaccurate numbers in original response.
  • This retrieved information is added to the original user query, creating an enhanced or augmented prompt.
  • A larger LLM (in our case, Gemini 1.5 Pro) uses this enhanced prompt, including the retrieved data, to generate a better, more accurate and factual response.

Step by Step Procedure on Google Colab

We will now look in to the step by step procedure of RAG for minimizing hallucinations.

Pre-requisites:

  • A100 GPU
  • High-RAM runtime 
  • Hugging Face Token
  • Data Commons API Token
  • Gemini 1.5 Pro API Key

Step1: Create Gemini API Key

Go to Google AI studio and create Gemini API key. 

Step1: Create Gemini API Key
Create API key

Step2: Enable Notebook Access

Go to your Google Colab notebook Secrets section and enter Hugging Face, Data Commons and Gemini 1.5 Pro API key. Enable Notebook access. 

Enter all tokens and API key values

Step3: Install the Required Libraries

In this step, you’ll install the necessary libraries that enable the implementation of the RIG technique and ensure smooth operation of the DataGemma system.

#install libraries
!pip install -q git+https://github.com/datacommonsorg/llm-tools
!pip install -q bitsandbytes accelerate

#load fine-tuned Gemma2 27B model
import torch

import data_gemma as dg

from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Initialize Data Commons API client
DC_API_KEY = userdata.get('DC_API_KEY')
dc = dg.DataCommons(api_key=DC_API_KEY)

# Get Gemini 1.5 Pro model
GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
gemini_model = dg.GoogleAIStudio(model='gemini-1.5-pro', api_keys=[GEMINI_API_KEY])


# Get finetuned Gemma2 model from HuggingFace
HF_TOKEN = userdata.get('HF_TOKEN')

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
datagemma_model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=nf4_config,
                                             torch_dtype=torch.bfloat16,
                                             token=HF_TOKEN)

# Build the LLM Model stub to use in RAG flow
datagemma_model_wrapper = dg.HFBasic(datagemma_model, tokenizer)

Step4: Pick or Create Your Own Query

You’ll select or create a custom query that will serve as the input for the RIG technique to retrieve and generate the desired output.

Query

Step5: Run RAG and generate the output

Now you’ll execute the RAG system to retrieve relevant data and generate the final output based on the query you provided.

from IPython.display import Markdown
import textwrap

def display_chat(prompt, text):
  formatted_prompt = "<font size='+1' color='brown'>🙋‍♂️<blockquote>" + prompt + "</blockquote></font>"
  text = text.replace('•', '  *')
  text = textwrap.indent(text, '> ', predicate=lambda _: True)
  formatted_text = "<font size='+1' color='teal'>🤖\n\n" + text + "\n</font>"
  return Markdown(formatted_prompt+formatted_text)

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

ans = dg.RAGFlow(llm_question=datagemma_model_wrapper, llm_answer=gemini_model, data_fetcher=dc).query(query=QUERY)
Markdown(textwrap.indent(ans.answer(), '> ', predicate=lambda _: True))


display_chat(QUERY, ans.answer())

Output: 

Query Output
Query output generated with relevant data tables

Conclusion: When a query is asked, the relevant data tables related to the query are retrieved and then this data is used to compose the final response with meaningful information and insights. The query response along with source links, tables, and conclusion is generated as output. 

Link: Data Gemma RAG

Why is DataGemma Important?

DataGemma grounds LLM outputs in real-world data, ensuring that the model generates fact-based responses. By fact-checking the model’s responses with verified data from Google’s Data Commons, DataGemma helps reduce the number of incorrect or fabricated answers. Using the RIG and RAG approaches, researchers at Google have observed significant improvement in the accuracy of output generated by the model, especially in dealing with queries that require numerical outputs.

They have observed that users prefer the output generated by RIG and RAG more than the baseline output.  This approach can reduce AI hallucinations, it can reduce the generation of misinformation. Also, since Google has made this Gemma model variant open model, it can be used by developers and researchers to explore this approach and enhance it further to achieve the common goal of making LLMs more reliable and trustworthy. 

Conclusion

LLMs have become vital tools across industries, but their tendency to “hallucinate”—generating convincing but incorrect information—poses a significant issue. Google’s DataGemma, when combined with the vast real-world data of Google’s Data Commons, provides a possible solution to this problem. The techniques in DataGemma improve accuracy, particularly with numerical information, by basing LLM outputs on validated statistical data. It also decreases misinformation. Early results show that this strategy considerably increases the credibility of AI responses, with consumers preferring the more factual outputs given by the system. Because DataGemma is an open model, researchers and developers can make use of it and improve it, bringing LLMs closer to becoming reliable tools for real-world applications. Collaboration can help reduce hallucinations and increase trustworthiness.

References

Frequently Asked Questions

Q1. What is a foundation model?

A. A foundation model is a large machine learning model trained on huge amounts of diverse data, enabling it to generalize across a wide range of tasks. LLMs are a type of foundation models trained on vast amounts of textual data. 

Q2. What is AI hallucination?

A. AI hallucination refers to the phenomenon where an AI model generates information that seems accurate but is incorrect or fabricated. The model produces responses that lack grounding in real-world data or facts.

Q3. Why do LLMs hallucinate?

A. LLMs hallucinate because they generate outputs based on patterns in the data they have been trained on. When they don’t have enough context or relevant data to answer a query, they may fabricate plausible-sounding information instead of admitting uncertainty, based on similar data found in it’s existing knowledge base. 

Q4. What is Google Gemma?

A. Google Gemma is a light-weight LLM model of Google based on Google Gemini’s research. A variant of Gemma is DataGemma which is an open model developed to connect LLMs with real-world statistical data from Google’s Data Commons. 

Q5. What is the difference between RIG and RAG?

A. RIG integrates real-world statistical data directly into the model’s output by checking generated responses against external data sources, such as Google Data Commons. So basically response is generated and then it is fact-checked with external sources. But in RAG, it retrieves relevant information from external databases or knowledge sources and then generates responses based on this information. 

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

aditi3807991 20 Sep, 2024

Hello data enthusiasts! I am V Aditi, a rising and dedicated data science and artificial intelligence student embarking on a journey of exploration and learning in the world of data and machines. Join me as I navigate through the fascinating world of data science, unraveling its mysteries and sharing insights along the way! 📊✨

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,