GraphRAG adopts a more structured and hierarchical method to Retrieval Augmented Generation (RAG), distinguishing itself from traditional RAG approaches that rely on basic semantic searches of unorganized text snippets. The process begins by converting raw text into a knowledge graph, organizing the data into a community structure, and summarizing these groupings. This structured approach allows GraphRAG to leverage this organized information, enhancing its effectiveness in RAG-based tasks and delivering more precise and context-aware results.
This article was published as a part of the Data Science Blogathon.
Retrieval-Augmented Generation (RAG) is a novel methodology that integrates the power of pre-trained large language models (LLMs) with external data sources to create more precise and contextually rich outputs.The synergy of state of the art LLMs with contextual data enables RAG to deliver responses that are not only well-articulated but also grounded in factual and domain-specific knowledge.
GraphRAG (Graph-based Retrieval Augmented Generation) is an advanced method of standard or traditional RAG that enhances it by leveraging knowledge graphs to improve information retrieval and response generation. Unlike standard RAG, which relies on simple semantic search and plain text snippets, GraphRAG organizes and processes information in a structured, hierarchical format.
Struggles with Information Scattered Across Different Sources. Traditional Retrieval-Augmented Generation (RAG) faces challenges when it comes to synthesizing information scattered across multiple sources. It struggles to identify and combine insights linked by subtle or indirect relationships, making it less effective for questions requiring interconnected reasoning.
Lacks in Capturing Broader Context. Traditional RAG methods often fall short in capturing the broader context or summarizing complex datasets. This limitation stems from a lack of deeper semantic understanding needed to extract overarching themes or accurately distill key points from intricate documents. When we execute a query like “What are the main themes in the dataset?”, it becomes difficult for traditional RAG to identify relevant text chunks unless the dataset explicitly defines those themes. In essence, this is a query-focused summarization task rather than an explicit retrieval task in which the traditional RAG struggles with.
We will now look into the limitations of RAG addressed by GraphRAG:
GraphRAG extends the capabilities of traditional Retrieval-Augmented Generation (RAG) by incorporating a two-phase operational design: an indexing phase and a querying phase. During the indexing phase, it constructs a knowledge graph, hierarchically organizing the extracted information. In the querying phase, it leverages this structured representation to deliver highly contextual and precise responses to user queries.
Indexing phase comprises of the following steps:
Equipped with a knowledge graph and detailed community summaries, GraphRAG can then respond to user queries with good accuracy leveraging the different steps present in the Querying phase.
Global Search – For inquiries that demand a broad analysis of the dataset, such as “What are the main themes discussed?”, GraphRAG utilizes the compiled community summaries. This approach enables the system to integrate insights across the dataset, delivering thorough and well-rounded answers.
Local Search – For queries targeting a specific entity, GraphRAG leverages the interconnected structure of the knowledge graph. By navigating the entity’s immediate connections and examining related claims, it gathers pertinent details, enabling the system to deliver accurate and context-sensitive responses.
Let us now look into Python Implementation of Microsoft’s GraphRAG in detailed steps below:
Make a folder and create a Python virtual environment in it. We create the folder GRAPHRAG as shown below. Within the created folder, we then install the graphrag library using the command – “pip install graphrag”.
pip install graphrag
Inside the GRAPHRAG folder, we create an input folder and put some text files in it within the folder. We have used this txt file and kept it inside the input folder. The text of the article has been taken from this news website.
From the folder that contains the input folder, run the following command:
python -m graphrag.index --init --root
This command leads to the creation of a .env file and a settings.yaml file.
In the .env file, enter your OpenAI key assigning it to the GRAPHRAG_API_KEY. This is then used by the settings.yaml file under the “llm” fields. Other parameters like model name, max_tokens, chunk size amongst many others can be defined in the settings.yaml file. We have used the “gpt-4o” model and defined it in the settings.yaml file.
We run the indexing pipeline using the following command from the inside of the “GRAPHRAG ” folder.
python -m graphrag.index --root .
All the steps in defined in the previous section under Indexing Phase takes place in the backend as soon as we execute the above command.
To execute all the steps of the indexing phase, such as entity and relationship detection, knowledge graph creation, community detection, and summary generation of different communities, the system makes multiple LLM calls using prompts defined in the “prompts” folder. The system generates this folder automatically when you run the indexing command.
Adapting prompts to align with the specific domain of your documents is essential for improving results. For example, in the entity_extraction.txt file, you can keep examples of relevant entities of the domain your text corpus is on to get more accurate results from RAG.
Additionally, LanceDB is used to store the embeddings data for each text chunk.
The output folder stores many parquet files corresponding to the graph and related data, as shown in the figure below.
In order to run a global query like “top themes of the document”, we can run the following command from the terminal within the GRAPHRAG folder.
python -m graphrag.query --root . --method global "What are the top themes in the document?"
A global query uses the generated community summaries to answer the question. The intermediate answers are used to generate the final answer.
The output for our txt file comes to be the following:
Comparison with Output of Naive RAG:
The code for Naive RAG can be found in my Github.
1. The integration of SAP and Microsoft 365 applications
2. The potential for a seamless user experience
3. The collaboration between SAP and Microsoft
4. The goal of maximizing productivity
5. The preview at Microsoft Ignite
6. The limited preview announcement
7. The opportunity to register for the limited preview.
In order to run a local query relevant to our document like “What is Microsoft and SAP collaboratively working towards?”, we can run the following command from the terminal within the GRAPHRAG folder. The command below specifically designates the query as a local query, ensuring that the execution delves deeper into the knowledge graph instead of relying on the community summaries used in global queries.
python -m graphrag.query --root . --method local "What is SAP and Microsoft collaboratively working towards?
Output of GraphRAG
Comparison with Output of Naive RAG:
The code for Naive RAG can be found in my Github.
Microsoft and SAP are working towards a seamless integration of their AI copilots, Joule and Microsoft 365 Copilot, to redefine workplace productivity and allow users to perform tasks and access data from both systems without switching between applications.
As observed from both the global and local outputs, the responses from GraphRAG are much more comprehensive and explainable as compared to responses from Naive RAG.
There are certain challenges that GraphRAG struggle, listed below:
GraphRAG demonstrates significant advancements over traditional RAG by addressing its limitations in reasoning, context understanding, and reliability. It excels in synthesizing dispersed information across datasets by leveraging knowledge graphs and structured entity relationships, enabling a deeper semantic understanding.
Microsoft’s GraphRAG enhances traditional RAG by combining a two-phase approach: indexing and querying. The indexing phase builds a hierarchical knowledge graph from extracted entities and relationships, organizing data into structured summaries. In the querying phase, GraphRAG leverages this structure for precise and context-rich responses, catering to both global dataset analysis and specific entity-based queries.
However, GraphRAG’s benefits come with challenges, including high resource demands, reliance on structured data, and the complexity of semantic clustering. Despite these hurdles, its ability to provide accurate, holistic responses establishes it as a powerful alternative to naive RAG systems for handling intricate queries.
A. GraphRAG excels at synthesizing insights across scattered sources by leveraging the interconnections between entities, unlike traditional RAG, which struggles with identifying subtle relationships.
A. It processes text chunks to extract entities and relationships, organizes them hierarchically using algorithms like Leiden, and builds a knowledge graph where nodes represent entities and edges indicate relationships.
Global Search: Uses community summaries for broad analysis, answering queries like “What are the main themes discussed?”.
Local Search: Focuses on specific entities by exploring their direct connections in the knowledge graph.
A. GraphRAG encounters issues like high computational costs due to multiple LLM calls, difficulties in semantic clustering, and complications with processing unstructured or noisy data.
A. By grounding its responses in hierarchical knowledge graphs and community-based summaries, GraphRAG provides deeper semantic understanding and contextually rich answers.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.