DataGemma: Grounding LLMs Against Hallucinations

Aditi V Last Updated : 22 Oct, 2024

13 min read

Introduction

Large Language Models are rapidly transforming industries—today, they power everything from personalized customer service in banking to real-time language translation in global communication. They can answer questions in natural language, summarize information, write essays, generate code, and much more, making them invaluable tools in today’s world. But despite their many advantages, they suffer from a critical flaw known as “hallucination”. These are instances when the model generates information that appears to be correct and realistic but is either partially or totally false, made up by the model and lacks any grounding in real-world data. Thus to tackle this, Google has developed an open model, a tool called DataGemma to connect LLMs with real-world data and fact-check their responses with trusted sources using Google’s Data Commons.

Learning Outcomes

Understand the basics of Large Language Models (LLMs) and their applications.
Explore the causes and types of hallucinations in LLMs.
Learn how Google’s DataGemma tackles LLM hallucinations using real-world data.
Gain insights into advanced techniques like Retrieval-Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG).
Discover how Google’s Data Commons improves LLM factual accuracy.

This article was published as a part of the Data Science Blogathon.

Understanding Large Language Models
Understanding Architecture of Gemma
What are Hallucinations in Context of LLMs?
What is DataGemma?
RIG: A Hybrid Approach for Minimizing LLM Hallucinations
Retrieval Augmented Generation for Minimizing LLM Hallucinations
Why is DataGemma Important?
Frequently Asked Questions

Understanding Large Language Models

Large Language Models are foundation models, trained on huge amounts of textual data with parameters ranging from millions to billions, that can understand and generate natural language. They are built on a transformer architecture that allows processing and generating natural language. An LLM model can be fine-tuned for specific tasks in specific domains by using customized datasets. For example, an LLM model like BERT can be fine-tuned on cybersecurity corpora to automate threat intelligence using LLMs. Some popular LLM models are GPT-4 by OpenAI, BERT and Gemini by Google, LLaMA by Meta, Claude by Anthropic etc.

Comparison of Gemma , Gemini and BERT

GEMMA	GEMINI	BERT
Lightweight model for developers	Larger and more powerful, conversational AI	Pre-trained model for NLP tasks
Ideal for applications with resource constraints like mobile phones & edge computing	Ideal for complex tasks with no resource constraints like large-scale data analysis, complex AI applications.	Ideal for tasks like text classification, question answering, sentiment analysis.
Easy to deploy in limited resources environment	Often deployed in cloud environments or data centers with abundant resources.	Deployed both on-premise or in cloud environments, but larger versions (like BERT-Large) require significant computational resources
Requires less computational resources	Often requires more computational resources.	Smaller models like BERT-Base can be deployed on moderate hardware, while larger models like BERT-Large may need more resources, but still less than Gemini.

Understanding Architecture of Gemma

The architecture of Gemma is designed to seamlessly integrate advanced retrieval and generation techniques, allowing the system to intelligently access external data sources while producing accurate, coherent responses, making it highly effective for various AI-driven applications.

Gemma is based on the transformer decoder architecture:

Gemma and Gemma 2 (the latest version released in 2024) belong to the Gemma family of Google’s LLM models. They can be fine-tuned for customized tasks. For example: CodeGemma models are fine-tuned Gemma models for code completion.

What are Hallucinations in Context of LLMs?

Hallucinations in LLMs are instances where the model confidently generates output which is inaccurate, inconsistent or made up information but it appears believable to us. The model hallucinates content and that content is actually not true. For example: in a court case, two lawyers cited sources provided by ChatGPT which turned out to be false.

AI Hallucinations can be of three types

Input conflicting hallucinations: The model generates an output that deviates from the information provided by the user in the input.
Context conflicting hallucinations: Here, the model generates an output contradicting it’s previously generated outputs.
Fact-conflicting hallucinations: Model generates false/inaccurate output that contradicts with real-world knowledge or facts.

What Causes Hallucinations?

Limited training data: When the model hasn’t been trained thoroughly or is trained on limited data, when it encounters a prompt different from it’s training data, even though it did not understand fully the new prompt, it might produce data based on it’s existing training data leading to inaccuracies.
Overfitting: When too many features are provided, the model will try to capture all the data points without understanding the underlying patterns and then get 100% accuracy on training data, but it won’t generalize well on new data.

As you can see, hallucinated LLM content can be harmful if used without fact-checking. In applications where factual accuracy is important and there can’t be any misinformation, like medical advice or legal guidance, hallucinations can lead to misinformation with potentially serious consequences. Hallucinations are delivered as confidently as correct answers, thus it can become difficult for users to recognise it. Also, as the reliance on AI for accurate information is rising, hallucinations can reduce trust in AI systems, making it harder for LLMs to be accepted in high-stakes domains.

Thus, model developers need to tackle this problem and ensure that in cases involving accuracy and facts, the LLM should generate correct, factual output to avoid the spread of misinformation. One such approach to tackle AI Hallucinations has been developed by Google in the form of DataGemma.

What is DataGemma?

DataGemma is an open model developed by Google to connect LLMs with trust-worthy, factual, real-world data sourced from Google’s DataCommons.

Google Data Commons is an open repository that combines a vast amount of public datasets into a unified format, making it easier to access and use. It combines data from a variety of sources, including government papers, research organizations, and global databases. The primary purpose of Data Commons is to provide a common framework for various datasets, allowing users to query and analyze structured real-world data across numerous domains without requiring pricey data cleaning or integration efforts.

Key Features of Data Commons

It includes data on a variety of topics such as demographics, economics, environment, and healthcare, sourced from places like the U.S. Census Bureau, World Bank, NOAA, and more.
The data is organized into a standardized schema, so users can easily query datasets without needing to deal with the complexities of different data formats and structures.
Developers can access Data Commons through APIs.
It’s a public service that is free to use, designed to make high-quality, reliable data accessible to everyone.

Importance of Data Commons

Researchers can use the Data Commons to quickly gather and analyze large, structured datasets without needing to source and clean the data manually.
Large Language Models (LLMs), like Google’s Gemma, can use Data Commons to reference real-world data, reducing hallucinations and improving factual accuracy in their outputs.

Link: Build your own Data Commons – Data Commons

RIG: A Hybrid Approach for Minimizing LLM Hallucinations

It is an advanced technique in natural language processing (NLP) that combines retrieval-based and generation-based methods to improve the quality and relevance of responses.

Here’s a brief explanation of how RIG works:

Retrieval-Based Methods: These methods involve searching a large database of pre-existing responses or documents to find the most relevant information. This approach ensures that the responses are accurate and grounded in real data.
Generation-Based Methods: These methods use models to generate responses from scratch based on the input. This allows for more flexible and creative responses but can sometimes lead to inaccuracies or hallucinations.
Interleaving: By interleaving or combining retrieval and generation techniques, RIG utilizes the strengths of both approaches. The system retrieves relevant information and then uses a generative model to refine and expand upon it, ensuring accuracy and creativity.

This is useful in applications where high-quality, contextually relevant responses are crucial, such as in conversational AI, customer support, and content creation.

In DataGemma, Gemma 2 is fine-tuned to recognize when to extract accurate information while generating an output. In this, it replaces the numbers generated in output, with more precise information from Data Commons. Thus, basically the model double-checks its output with a more trusted source.

How RIG is used in DataGemma?

In DataGemma, Retrieval-Interleaved Generation (RIG) is leveraged to enhance the accuracy and relevance of outputs by combining the strengths of both retrieval and generative models, ensuring that generated content is grounded in reliable data from trusted sources like Data Commons.

First, the user submits a query to the LLM model. In our case, the LLM model is DataGemma, which is based on Gemma 2 model with 27B parameters, fine-tuned for RIG.
The DataGemma model generates a response in the form of a natural language query. The purpose of this is to retrieve relevant data from Data Commons’ natural language interface.
Data Commons is queried, and the required data is retrieved.
The final response is generated and shown to the user. The response includes data, the source information along with its link, and some metadata. This replaces the potentially inaccurate numbers in original response.

Step by Step Procedure on Google Colab

Let us now implement RIG for minimizing hallucination.

Pre-requisites:

A100 GPU
High-RAM runtime
Hugging Face Token

Click here to login hugging face account.

Step1: Login to your hugging face account and create a new token

Create New Token:

Step2: DataCommons API Key

Click here to create your account.
Create a new app to integrate Data Commons with. Register for an API key.

Step3: Enable Data Commons NL API

Go to your Colab notebook Secrets section. Create new secret and enable notebook access.

HF_TOKEN with value as your Hugging Face token
DC_API_KEY with value as your Data Commons token

Step4: Install Required Libraries

Let us install required libraries.

#install the following required libraries 
!pip install -q git+https://github.com/datacommonsorg/llm-tools
!pip install -q bitsandbytes accelerate

#load the finetuned Gemma2 27B model 

import torch

import data_gemma as dg

from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Initialize Data Commons API client
DC_API_KEY = userdata.get('DC_API_KEY')
dc = dg.DataCommons(api_key=DC_API_KEY)


# Get finetuned Gemma2 model from HuggingFace
HF_TOKEN = userdata.get('HF_TOKEN')

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
datagemma_model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=nf4_config,
                                             torch_dtype=torch.bfloat16,
                                             token=HF_TOKEN)

# Build the LLM Model stub to use in RIG flow
datagemma_model_wrapper = dg.HFBasic(datagemma_model, tokenizer)

Step5: Pick or Enter a Query

In this step, users can either select a pre-defined query or input a custom query, enabling the system to retrieve relevant information from the data sources for further processing.

Step6: Run the RIG technique and Generate Output

In this step, the RIG technique is executed, combining retrieval and generation methods to produce a precise and contextually relevant output based on the input query.

from IPython.display import Markdown
import textwrap

def display_chat(prompt, text):
  formatted_prompt = "<font size='+1' color='brown'>🙋‍♂️<blockquote>" + prompt + "</blockquote></font>"
  text = text.replace('•', '  *')
  text = textwrap.indent(text, '> ', predicate=lambda _: True)
  formatted_text = "<font size='+1' color='teal'>🤖\n\n" + text + "\n</font>"
  return Markdown(formatted_prompt+formatted_text)

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))


ans = dg.RIGFlow(llm=datagemma_model_wrapper, data_fetcher=dc, verbose=False).query(query=QUERY)
Markdown(textwrap.indent(ans.answer(), '> ', predicate=lambda _: True))


display_chat(QUERY, ans.answer())

Output: (for a different query)

Conclusion: Gemma2 generates only a numerical value while DataGemma generates the numerical value along with its source information, source links, some meta data and conclusion for the query.

Source: Google Colab notebook provided by Google

Retrieval Augmented Generation for Minimizing LLM Hallucinations

Retrieval Augmented Generation is an approach in natural language processing (NLP) and large language models (LLMs) to improve the factual accuracy and relevance of the generated content by allowing the model to access external knowledge sources during the generation process. It retrieves relevant information from Data Commons before the LLM generates text, providing it with a factual foundation for its response.

Here’s a brief explanation of how RAG works:

Retrieval: When the user enters a query, the model receives it and then extracts the relevant data from its knowledge base or external sources.
Augmentation: This external information is then used to “augment” (or enhance) the input context for the language model, helping it generate more contextually relevant responses.
Generation: The LLM generates a response based on both the original query and the retrieved information.

How RAG is Used in DataGemma?

In DataGemma, Retrieval-Augmented Generation (RAG) is employed to enhance response accuracy by retrieving relevant data from external sources and then generating content that combines this retrieved knowledge with AI-generated insights, ensuring high-quality and contextually relevant outputs.

Here’s how RAG works:

First, the user submits a query to the LLM model. In our case, the LLM model is DataGemma, which is based on Gemma 2 model with 27B parameters, fine-tuned for RAG task.
The DataGemma model generates a response, after analyzing the input query, in the form of a natural language query. The purpose of this is to retrieve relevant data from Data Commons’ natural language interface.
Data Commons is queried and the required information is retrieved.
The final response is generated and shown to the user. This includes data tables, the source information along with its link, and some metadata. This replaces the potentially inaccurate numbers in original response.
This retrieved information is added to the original user query, creating an enhanced or augmented prompt.
A larger LLM (in our case, Gemini 1.5 Pro) uses this enhanced prompt, including the retrieved data, to generate a better, more accurate and factual response.

Step by Step Procedure on Google Colab

We will now look in to the step by step procedure of RAG for minimizing hallucinations.

Pre-requisites:

A100 GPU
High-RAM runtime
Hugging Face Token
Data Commons API Token
Gemini 1.5 Pro API Key

Step1: Create Gemini API Key

Go to Google AI studio and create Gemini API key.

Step2: Enable Notebook Access

Go to your Google Colab notebook Secrets section and enter Hugging Face, Data Commons and Gemini 1.5 Pro API key. Enable Notebook access.

Step3: Install the Required Libraries

In this step, you’ll install the necessary libraries that enable the implementation of the RIG technique and ensure smooth operation of the DataGemma system.

#install libraries
!pip install -q git+https://github.com/datacommonsorg/llm-tools
!pip install -q bitsandbytes accelerate

#load fine-tuned Gemma2 27B model
import torch

import data_gemma as dg

from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Initialize Data Commons API client
DC_API_KEY = userdata.get('DC_API_KEY')
dc = dg.DataCommons(api_key=DC_API_KEY)

# Get Gemini 1.5 Pro model
GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
gemini_model = dg.GoogleAIStudio(model='gemini-1.5-pro', api_keys=[GEMINI_API_KEY])


# Get finetuned Gemma2 model from HuggingFace
HF_TOKEN = userdata.get('HF_TOKEN')

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
datagemma_model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=nf4_config,
                                             torch_dtype=torch.bfloat16,
                                             token=HF_TOKEN)

# Build the LLM Model stub to use in RAG flow
datagemma_model_wrapper = dg.HFBasic(datagemma_model, tokenizer)

Step4: Pick or Create Your Own Query

You’ll select or create a custom query that will serve as the input for the RIG technique to retrieve and generate the desired output.

Step5: Run RAG and generate the output

Now you’ll execute the RAG system to retrieve relevant data and generate the final output based on the query you provided.

from IPython.display import Markdown
import textwrap

def display_chat(prompt, text):
  formatted_prompt = "<font size='+1' color='brown'>🙋‍♂️<blockquote>" + prompt + "</blockquote></font>"
  text = text.replace('•', '  *')
  text = textwrap.indent(text, '> ', predicate=lambda _: True)
  formatted_text = "<font size='+1' color='teal'>🤖\n\n" + text + "\n</font>"
  return Markdown(formatted_prompt+formatted_text)

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

ans = dg.RAGFlow(llm_question=datagemma_model_wrapper, llm_answer=gemini_model, data_fetcher=dc).query(query=QUERY)
Markdown(textwrap.indent(ans.answer(), '> ', predicate=lambda _: True))


display_chat(QUERY, ans.answer())

Output:

Conclusion: When a query is asked, the relevant data tables related to the query are retrieved and then this data is used to compose the final response with meaningful information and insights. The query response along with source links, tables, and conclusion is generated as output.

Link: Data Gemma RAG

Why is DataGemma Important?

DataGemma grounds LLM outputs in real-world data, ensuring that the model generates fact-based responses. By fact-checking the model’s responses with verified data from Google’s Data Commons, DataGemma helps reduce the number of incorrect or fabricated answers. Using the RIG and RAG approaches, researchers at Google have observed significant improvement in the accuracy of output generated by the model, especially in dealing with queries that require numerical outputs.

They have observed that users prefer the output generated by RIG and RAG more than the baseline output. This approach can reduce AI hallucinations, it can reduce the generation of misinformation. Also, since Google has made this Gemma model variant open model, it can be used by developers and researchers to explore this approach and enhance it further to achieve the common goal of making LLMs more reliable and trustworthy.

Conclusion

LLMs have become vital tools across industries, but their tendency to “hallucinate”—generating convincing but incorrect information—poses a significant issue. Google’s DataGemma, when combined with the vast real-world data of Google’s Data Commons, provides a possible solution to this problem. The techniques in DataGemma improve accuracy, particularly with numerical information, by basing LLM outputs on validated statistical data. It also decreases misinformation. Early results show that this strategy considerably increases the credibility of AI responses, with consumers preferring the more factual outputs given by the system. Because DataGemma is an open model, researchers and developers can make use of it and improve it, bringing LLMs closer to becoming reliable tools for real-world applications. Collaboration can help reduce hallucinations and increase trustworthiness.

References

Frequently Asked Questions

Q1. What is a foundation model?

A. A foundation model is a large machine learning model trained on huge amounts of diverse data, enabling it to generalize across a wide range of tasks. LLMs are a type of foundation models trained on vast amounts of textual data.

Q2. What is AI hallucination?

A. AI hallucination refers to the phenomenon where an AI model generates information that seems accurate but is incorrect or fabricated. The model produces responses that lack grounding in real-world data or facts.

Q3. Why do LLMs hallucinate?

A. LLMs hallucinate because they generate outputs based on patterns in the data they have been trained on. When they don’t have enough context or relevant data to answer a query, they may fabricate plausible-sounding information instead of admitting uncertainty, based on similar data found in it’s existing knowledge base.

Q4. What is Google Gemma?

A. Google Gemma is a light-weight LLM model of Google based on Google Gemini’s research. A variant of Gemma is DataGemma which is an open model developed to connect LLMs with real-world statistical data from Google’s Data Commons.

Q5. What is the difference between RIG and RAG?

A. RIG integrates real-world statistical data directly into the model’s output by checking generated responses against external data sources, such as Google Data Commons. So basically response is generated and then it is fact-checked with external sources. But in RAG, it retrieves relevant information from external databases or knowledge sources and then generates responses based on this information.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Aditi V

Hello data enthusiasts! I am V Aditi, a rising and dedicated data science and artificial intelligence student embarking on a journey of exploration and learning in the world of data and machines. Join me as I navigate through the fascinating world of data science and artificial intelligence, unraveling mysteries and sharing insights along the way! 📊✨

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

DataGemma: Grounding LLMs Against Hallucinations

Introduction

Learning Outcomes

Table of contents

Understanding Large Language Models

Comparison of Gemma , Gemini and BERT

Understanding Architecture of Gemma

What are Hallucinations in Context of LLMs?

AI Hallucinations can be of three types

What Causes Hallucinations?

What is DataGemma?

Key Features of Data Commons

Importance of Data Commons

RIG: A Hybrid Approach for Minimizing LLM Hallucinations

How RIG is used in DataGemma?

Step by Step Procedure on Google Colab

Step1: Login to your hugging face account and create a new token

Step2: DataCommons API Key

Step3: Enable Data Commons NL API

Step4: Install Required Libraries

Step5: Pick or Enter a Query

Step6: Run the RIG technique and Generate Output

Retrieval Augmented Generation for Minimizing LLM Hallucinations

How RAG is Used in DataGemma?

Step by Step Procedure on Google Colab

Step1: Create Gemini API Key

Step2: Enable Notebook Access

Step3: Install the Required Libraries

Step4: Pick or Create Your Own Query

Step5: Run RAG and generate the output

Why is DataGemma Important?

Conclusion

References

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS