Large Language Models are rapidly transforming industries—today, they power everything from personalized customer service in banking to real-time language translation in global communication. They can answer questions in natural language, summarize information, write essays, generate code, and much more, making them invaluable tools in today’s world. But despite their many advantages, they suffer from a critical flaw known as “hallucination”. These are instances when the model generates information that appears to be correct and realistic but is either partially or totally false, made up by the model and lacks any grounding in real-world data. Thus to tackle this, Google has developed an open model, a tool called DataGemma to connect LLMs with real-world data and fact-check their responses with trusted sources using Google’s Data Commons.
This article was published as a part of the Data Science Blogathon.
Large Language Models are foundation models, trained on huge amounts of textual data with parameters ranging from millions to billions, that can understand and generate natural language. They are built on a transformer architecture that allows processing and generating natural language. An LLM model can be fine-tuned for specific tasks in specific domains by using customized datasets. For example, an LLM model like BERT can be fine-tuned on cybersecurity corpora to automate threat intelligence using LLMs. Some popular LLM models are GPT-4 by OpenAI, BERT and Gemini by Google, LLaMA by Meta, Claude by Anthropic etc.
GEMMA | GEMINI | BERT |
Lightweight model for developers | Larger and more powerful, conversational AI | Pre-trained model for NLP tasks |
Ideal for applications with resource constraints like mobile phones & edge computing | Ideal for complex tasks with no resource constraints like large-scale data analysis, complex AI applications. | Ideal for tasks like text classification, question answering, sentiment analysis. |
Easy to deploy in limited resources environment | Often deployed in cloud environments or data centers with abundant resources. | Deployed both on-premise or in cloud environments, but larger versions (like BERT-Large) require significant computational resources |
Requires less computational resources | Often requires more computational resources. | Smaller models like BERT-Base can be deployed on moderate hardware, while larger models like BERT-Large may need more resources, but still less than Gemini. |
The architecture of Gemma is designed to seamlessly integrate advanced retrieval and generation techniques, allowing the system to intelligently access external data sources while producing accurate, coherent responses, making it highly effective for various AI-driven applications.
Gemma is based on the transformer decoder architecture:
Gemma and Gemma 2 (the latest version released in 2024) belong to the Gemma family of Google’s LLM models. They can be fine-tuned for customized tasks. For example: CodeGemma models are fine-tuned Gemma models for code completion.
Hallucinations in LLMs are instances where the model confidently generates output which is inaccurate, inconsistent or made up information but it appears believable to us. The model hallucinates content and that content is actually not true. For example: in a court case, two lawyers cited sources provided by ChatGPT which turned out to be false.
As you can see, hallucinated LLM content can be harmful if used without fact-checking. In applications where factual accuracy is important and there can’t be any misinformation, like medical advice or legal guidance, hallucinations can lead to misinformation with potentially serious consequences. Hallucinations are delivered as confidently as correct answers, thus it can become difficult for users to recognise it. Also, as the reliance on AI for accurate information is rising, hallucinations can reduce trust in AI systems, making it harder for LLMs to be accepted in high-stakes domains.
Thus, model developers need to tackle this problem and ensure that in cases involving accuracy and facts, the LLM should generate correct, factual output to avoid the spread of misinformation. One such approach to tackle AI Hallucinations has been developed by Google in the form of DataGemma.
DataGemma is an open model developed by Google to connect LLMs with trust-worthy, factual, real-world data sourced from Google’s DataCommons.
Google Data Commons is an open repository that combines a vast amount of public datasets into a unified format, making it easier to access and use. It combines data from a variety of sources, including government papers, research organizations, and global databases. The primary purpose of Data Commons is to provide a common framework for various datasets, allowing users to query and analyze structured real-world data across numerous domains without requiring pricey data cleaning or integration efforts.
Link: Build your own Data Commons – Data Commons
It is an advanced technique in natural language processing (NLP) that combines retrieval-based and generation-based methods to improve the quality and relevance of responses.
Here’s a brief explanation of how RIG works:
This is useful in applications where high-quality, contextually relevant responses are crucial, such as in conversational AI, customer support, and content creation.
In DataGemma, Gemma 2 is fine-tuned to recognize when to extract accurate information while generating an output. In this, it replaces the numbers generated in output, with more precise information from Data Commons. Thus, basically the model double-checks its output with a more trusted source.
In DataGemma, Retrieval-Interleaved Generation (RIG) is leveraged to enhance the accuracy and relevance of outputs by combining the strengths of both retrieval and generative models, ensuring that generated content is grounded in reliable data from trusted sources like Data Commons.
Let us now implement RIG for minimizing hallucination.
Pre-requisites:
Click here to login hugging face account.
Create New Token:
Go to your Colab notebook Secrets section. Create new secret and enable notebook access.
Let us install required libraries.
#install the following required libraries
!pip install -q git+https://github.com/datacommonsorg/llm-tools
!pip install -q bitsandbytes accelerate
#load the finetuned Gemma2 27B model
import torch
import data_gemma as dg
from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# Initialize Data Commons API client
DC_API_KEY = userdata.get('DC_API_KEY')
dc = dg.DataCommons(api_key=DC_API_KEY)
# Get finetuned Gemma2 model from HuggingFace
HF_TOKEN = userdata.get('HF_TOKEN')
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_name = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
datagemma_model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
token=HF_TOKEN)
# Build the LLM Model stub to use in RIG flow
datagemma_model_wrapper = dg.HFBasic(datagemma_model, tokenizer)
In this step, users can either select a pre-defined query or input a custom query, enabling the system to retrieve relevant information from the data sources for further processing.
In this step, the RIG technique is executed, combining retrieval and generation methods to produce a precise and contextually relevant output based on the input query.
from IPython.display import Markdown
import textwrap
def display_chat(prompt, text):
formatted_prompt = "<font size='+1' color='brown'>🙋♂️<blockquote>" + prompt + "</blockquote></font>"
text = text.replace('•', ' *')
text = textwrap.indent(text, '> ', predicate=lambda _: True)
formatted_text = "<font size='+1' color='teal'>🤖\n\n" + text + "\n</font>"
return Markdown(formatted_prompt+formatted_text)
def to_markdown(text):
text = text.replace('•', ' *')
return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))
ans = dg.RIGFlow(llm=datagemma_model_wrapper, data_fetcher=dc, verbose=False).query(query=QUERY)
Markdown(textwrap.indent(ans.answer(), '> ', predicate=lambda _: True))
display_chat(QUERY, ans.answer())
Output: (for a different query)
Conclusion: Gemma2 generates only a numerical value while DataGemma generates the numerical value along with its source information, source links, some meta data and conclusion for the query.
Source: Google Colab notebook provided by Google
Retrieval Augmented Generation is an approach in natural language processing (NLP) and large language models (LLMs) to improve the factual accuracy and relevance of the generated content by allowing the model to access external knowledge sources during the generation process. It retrieves relevant information from Data Commons before the LLM generates text, providing it with a factual foundation for its response.
Here’s a brief explanation of how RAG works:
In DataGemma, Retrieval-Augmented Generation (RAG) is employed to enhance response accuracy by retrieving relevant data from external sources and then generating content that combines this retrieved knowledge with AI-generated insights, ensuring high-quality and contextually relevant outputs.
Here’s how RAG works:
We will now look in to the step by step procedure of RAG for minimizing hallucinations.
Pre-requisites:
Go to Google AI studio and create Gemini API key.
Go to your Google Colab notebook Secrets section and enter Hugging Face, Data Commons and Gemini 1.5 Pro API key. Enable Notebook access.
In this step, you’ll install the necessary libraries that enable the implementation of the RIG technique and ensure smooth operation of the DataGemma system.
#install libraries
!pip install -q git+https://github.com/datacommonsorg/llm-tools
!pip install -q bitsandbytes accelerate
#load fine-tuned Gemma2 27B model
import torch
import data_gemma as dg
from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# Initialize Data Commons API client
DC_API_KEY = userdata.get('DC_API_KEY')
dc = dg.DataCommons(api_key=DC_API_KEY)
# Get Gemini 1.5 Pro model
GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
gemini_model = dg.GoogleAIStudio(model='gemini-1.5-pro', api_keys=[GEMINI_API_KEY])
# Get finetuned Gemma2 model from HuggingFace
HF_TOKEN = userdata.get('HF_TOKEN')
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_name = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
datagemma_model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
token=HF_TOKEN)
# Build the LLM Model stub to use in RAG flow
datagemma_model_wrapper = dg.HFBasic(datagemma_model, tokenizer)
You’ll select or create a custom query that will serve as the input for the RIG technique to retrieve and generate the desired output.
Now you’ll execute the RAG system to retrieve relevant data and generate the final output based on the query you provided.
from IPython.display import Markdown
import textwrap
def display_chat(prompt, text):
formatted_prompt = "<font size='+1' color='brown'>🙋♂️<blockquote>" + prompt + "</blockquote></font>"
text = text.replace('•', ' *')
text = textwrap.indent(text, '> ', predicate=lambda _: True)
formatted_text = "<font size='+1' color='teal'>🤖\n\n" + text + "\n</font>"
return Markdown(formatted_prompt+formatted_text)
def to_markdown(text):
text = text.replace('•', ' *')
return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))
ans = dg.RAGFlow(llm_question=datagemma_model_wrapper, llm_answer=gemini_model, data_fetcher=dc).query(query=QUERY)
Markdown(textwrap.indent(ans.answer(), '> ', predicate=lambda _: True))
display_chat(QUERY, ans.answer())
Output:
Conclusion: When a query is asked, the relevant data tables related to the query are retrieved and then this data is used to compose the final response with meaningful information and insights. The query response along with source links, tables, and conclusion is generated as output.
Link: Data Gemma RAG
DataGemma grounds LLM outputs in real-world data, ensuring that the model generates fact-based responses. By fact-checking the model’s responses with verified data from Google’s Data Commons, DataGemma helps reduce the number of incorrect or fabricated answers. Using the RIG and RAG approaches, researchers at Google have observed significant improvement in the accuracy of output generated by the model, especially in dealing with queries that require numerical outputs.
They have observed that users prefer the output generated by RIG and RAG more than the baseline output. This approach can reduce AI hallucinations, it can reduce the generation of misinformation. Also, since Google has made this Gemma model variant open model, it can be used by developers and researchers to explore this approach and enhance it further to achieve the common goal of making LLMs more reliable and trustworthy.
LLMs have become vital tools across industries, but their tendency to “hallucinate”—generating convincing but incorrect information—poses a significant issue. Google’s DataGemma, when combined with the vast real-world data of Google’s Data Commons, provides a possible solution to this problem. The techniques in DataGemma improve accuracy, particularly with numerical information, by basing LLM outputs on validated statistical data. It also decreases misinformation. Early results show that this strategy considerably increases the credibility of AI responses, with consumers preferring the more factual outputs given by the system. Because DataGemma is an open model, researchers and developers can make use of it and improve it, bringing LLMs closer to becoming reliable tools for real-world applications. Collaboration can help reduce hallucinations and increase trustworthiness.
A. A foundation model is a large machine learning model trained on huge amounts of diverse data, enabling it to generalize across a wide range of tasks. LLMs are a type of foundation models trained on vast amounts of textual data.
A. AI hallucination refers to the phenomenon where an AI model generates information that seems accurate but is incorrect or fabricated. The model produces responses that lack grounding in real-world data or facts.
A. LLMs hallucinate because they generate outputs based on patterns in the data they have been trained on. When they don’t have enough context or relevant data to answer a query, they may fabricate plausible-sounding information instead of admitting uncertainty, based on similar data found in it’s existing knowledge base.
A. Google Gemma is a light-weight LLM model of Google based on Google Gemini’s research. A variant of Gemma is DataGemma which is an open model developed to connect LLMs with real-world statistical data from Google’s Data Commons.
A. RIG integrates real-world statistical data directly into the model’s output by checking generated responses against external data sources, such as Google Data Commons. So basically response is generated and then it is fact-checked with external sources. But in RAG, it retrieves relevant information from external databases or knowledge sources and then generates responses based on this information.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.