In today’s data-driven world, safeguarding Personally Identifiable Information (PII) is paramount. PII encompasses data like names, addresses, phone numbers, and financial records, vital for individual identification. With the rise of artificial intelligence and its vast data processing capabilities, protecting PII while harnessing its potential for personalized experiences is crucial. Retrieval Augmented Generation (RAG) emerges as a solution, blending information retrieval with advanced language generation models. These systems sift through extensive data repositories to extract relevant information, refining AI-generated outputs for precision and context.
Yet, the utilization of user data poses risks of unintentional PII exposure. PII detection technologies mitigate this risk, automatically identifying and concealing sensitive data. With stringent privacy measures, RAG models leverage user data to offer tailored services while upholding privacy standards. This integration underscores the ongoing endeavor to balance personalized data usage with user privacy, prioritizing data confidentiality as AI technology advances.
This article was published as a part of the Data Science Blogathon.
Let’s start our exploration with the NERPIINodePostprocessor tool from Llama Index. For that, we will need to install a few necessary packages.
The list of necessary packages is listed below:
llama-index==0.10.22
llama-index-agent-openai==0.1.7
llama-index-cli==0.1.11
llama-index-core==0.10.23
llama-index-indices-managed-llama-cloud==0.1.4
llama-index-legacy==0.9.48
llama-index-multi-modal-llms-openai==0.1.4
llama-index-postprocessor-presidio==0.1.1
llama-parse==0.3.9
llamaindex-py-client==0.1.13
presidio-analyzer==2.2.353
presidio-anonymizer==2.2.353
pydantic==2.5.3
pydantic_core==2.14.6
spacy==3.7.4
torch==2.2.1+cpu
transformers==4.39.1
To test the tool, we require dummy data for PII detection. For experimentation, handwritten texts containing fabricated names, dates, credit card numbers, phone numbers, and email addresses were utilized. Alternatively, any text of choice can be used for testing, or GPT can be employed to generate text. The following texts will be utilized for our experimentation:
text = """
Hi there! You can call me Max Turner. Reach out at [email protected],
and you'll find me strolling the streets of Vienna. My plastic friend, the
Mastercard, reads 5300-1234-5678-9000. Ever vibed at a gig by Zsofia Kovacs?
I'm curious. As for my card, it has a limit I'd rather not disclose here;
however, my bank details are as follows: AT611904300235473201. Turner is the
family name. Tracing my roots, I've got ancestors named Leopold Turner and
Elisabeth Baumgartner. Also, a quick FYI: I tried to visit your website, but
my IP (203.0.113.5) seems to be barred. I did, however, manage to post a
visual at this link: http://MegaMovieMoments.fi.
"""
With the packages installed and sample text prepared, we proceed to utilize the NERPIINodePostprocessor tool. Importing NERPIINodePostprocessor from Llama Index is necessary, along with importing the TextNode schema from Llama Index to create a text node. This step is crucial as NERPIINodePostprocessor operates on TextNode objects rather than raw strings.
Below is the code snippet for imports:
from llama_index.core.postprocessor import NERPIINodePostprocessor
from llama_index.core.schema import TextNode
from llama_index.core.schema import NodeWithScore
Following the imports, we proceed to create a TextNode object using our sample text.
text_node = TextNode(text=text)
Subsequently, we create a NERPIINodePostprocessor object and apply it to our TextNode object to post-process and mask the sensitive entities.
processor = NERPIINodePostprocessor()
new_nodes = processor.postprocess_nodes(
[NodeWithScore(node=text_node)]
)
After completing the post-processing of our text, we can now examine the post-processed text alongside the PII entity mapping.
pprint(new_nodes[0].node.get_content())
# OUTPUT
# 'Hi there! You can call me [PER_26]. Reach out at [email protected], '
# "and you'll find me strolling the streets of [LOC_122]. My plastic friend, "
# 'the [ORG_153], reads 5300-1234-5678-9000. Ever vibed at a gig by [PER_215]? '
# "I'm curious. As for my card, it has a limit I'd rather not disclose here; "
# 'however, my bank details are as follows: AT611904300235473201. [PER_367] is '
# "the family name. Tracing my roots, I've got ancestors named Leopold "
# '[PER_367] and [PER_456]. Also, a quick FYI: I tried to visit your website, '
# 'but my IP (203.0.113.5) seems to be barred. I did, however, manage to post a '
# 'visual at this link: [ORG_627].fi.')
pprint(new_nodes[0].node.metadata)
# OUTPUT
# {'__pii_node_info__': {'[LOC_122]': 'Vienna',
# '[ORG_153]': 'Mastercard',
# '[ORG_627]': 'MegaMovieMoments',
# '[PER_215]': 'Zsofia Kovacs',
# '[PER_26]': 'Max Turner',
# '[PER_367]': 'Turner',
# '[PER_437]': 'Leopold Turner',
# '[PER_456]': 'Elisabeth Baumgartner'}}
Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Dive in today! Explore Now
Upon reviewing the results, it’s evident that the postprocessor fails to mask highly sensitive entities such as credit card numbers, phone numbers, and email addresses. This outcome deviates from our intention, as we aimed to mask all sensitive entities including names, addresses, credit card numbers, and email addresses.
While the NERPIINodePostprocessor effectively masks Named Entities like person and company names, with their respective entity type and count, it proves inadequate for masking texts containing highly sensitive content. Now that we understand the functionality of the NERPIINodePostprocessor and its limitations in masking sensitive information, let’s assess the performance of Presidio on the same text. We’ll explore Presidio’s functionality first and then proceed with utilizing Llama Index’s Presidio implementation.
To begin, import the requisite packages. This includes the AnalyzerEngine and AnonymizerEngine from Presidio. Additionally, import the PresidioPIINodePostprocessor, which serves as the Llama Index’s integration of Presidio.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from llama_index.postprocessor.presidio import PresidioPIINodePostprocessor
Proceed by initializing the Analyzer Engine using the list of supported languages. Set it to a list containing ‘en’ for the English language. This enables Presidio to determine the language of the text content. Subsequently, utilize the analyzer instance to analyze the text.
analyzer = AnalyzerEngine(supported_languages=["en"])
results = analyzer.analyze(text=text, language="en")
Below is the result after analyzing the text content. It shows the PII entity type, its star and end index in the string and the probability score.
After initializing the Analyzer Engine, proceed to initialize the Anonymizer Engine. This component will anonymize the original text based on the results obtained from the Analyzer Engine.
engine = AnonymizerEngine()
new_text = engine.anonymize(text=text, analyzer_results=results)
Below is the output from the anonymizer engine, showcasing the original text with masked PII entities.
pprint(new_text.text)
# OUTPUT
# "Hi there! You can call me <PERSON>. Reach out at <EMAIL_ADDRESS>, and you'll "
# 'find me strolling the streets of <LOCATION>. My plastic friend, the '
# "<IN_PAN>, reads <IN_PAN>5678-9000. Ever vibed at a gig by <PERSON>? I'm "
# "curious. As for my card, it has a limit I'd rather not disclose here; "
# 'however, my bank details are as follows: AT611904300235473201. <PERSON> is '
# "the family name. Tracing my roots, I've got ancestors named <PERSON> and "
# '<PERSON>. Also, a quick FYI: I tried to visit your website, but my IP '
# '(<IP_ADDRESS>) seems to be barred. I did, however, manage to post a visual '
# 'at this link: <URL>.'
Also Read: RAG Powered Document QnA & Semantic Caching with Gemini Pro
Presidio effectively masks all PII entities by enclosing their entity type within ‘<‘ and ‘>’. However, the masking lacks unique identifiers for entity items. Here, Llama Index integration enhances the process. The Presidio implementation of Llama Index not only returns the masked text with entity type counts but also provides a deanonymizer map for deanonymization. Let’s explore how to utilize these features.
text_node = TextNode(text=text)
processor = PresidioPIINodePostprocessor()
new_nodes = processor.postprocess_nodes(
[NodeWithScore(node=text_node)]
)
pprint(new_nodes[0].node.get_content())
# OUTPUT
# 'Hi there! You can call me <PERSON_5>. Reach out at <EMAIL_ADDRESS_1>, and '
# "you'll find me strolling the streets of <LOCATION_1>. My plastic friend, the "
# '<IN_PAN_2>, reads <IN_PAN_1>5678-9000. Ever vibed at a gig by <PERSON_4>? '
# "I'm curious. As for my card, it has a limit I'd rather not disclose here; "
# 'however, my bank details are as follows: AT611904300235473201. <PERSON_3> is '
# "the family name. Tracing my roots, I've got ancestors named <PERSON_2> and "
# '<PERSON_1>. Also, a quick FYI: I tried to visit your website, but my IP '
# '(<IP_ADDRESS_1>) seems to be barred. I did, however, manage to post a visual '
# 'at this link: <URL_1>.'
pprint(new_nodes[0].metadata)
# OUTPUT
# {'__pii_node_info__': {'<EMAIL_ADDRESS_1>': '[email protected]',
# '<IN_PAN_1>': '5300-1234-',
# '<IN_PAN_2>': 'Mastercard',
# '<IP_ADDRESS_1>': '203.0.113.5',
# '<LOCATION_1>': 'Vienna',
# '<PERSON_1>': 'Elisabeth Baumgartner',
# '<PERSON_2>': 'Leopold Turner',
# '<PERSON_3>': 'Turner',
# '<PERSON_4>': 'Zsofia Kovacs',
# '<PERSON_5>': 'Max Turner',
# '<URL_1>': 'MegaMovieMoments.fi'}}
The masked text generated by PresidioPIINodePostprocessor effectively masks all PII entities, indicating their entity type and count. Additionally, it provides a deanonymizer map, facilitating the subsequent deanonymization of the masked text.
By leveraging the PresidioPIINodePostprocessor tool, we can seamlessly anonymize information within our RAG pipeline, prioritizing user data privacy. Within the RAG pipeline, it can serve as a data anonymizer during data ingestion, effectively masking sensitive information. Similarly, in the query pipeline, it can function as a deanonymizer, allowing authenticated users to access sensitive information while maintaining privacy. The deanonymizer map can be securely stored in a protected location, ensuring the confidentiality of sensitive data throughout the process.
The PII anonymizer tool finds utility in RAG pipelines dealing with financial documents or sensitive user/organization information, necessitating protection from unidentified or unauthorized access. It ensures secure storage of anonymized document contents within the vector store, even in the event of a data breach. Additionally, it proves valuable in RAG pipelines involving organization or personal emails, where sensitive data like addresses, password change URLs, and OTPs are prevalent, necessitating ingestion in an anonymized state.
While the PII detection tool can be useful in RAG pipelines, there are some limitations to implementing it into an RAG pipeline.
In conclusion, the incorporation of PII detection and masking tools like Presidio into RAG pipelines marks a notable stride in AI’s capacity to handle sensitive data while upholding individual privacy. Through the utilization of advanced techniques and customizable features, Presidio elevates the security and adaptability of text generation, meeting the escalating need for data privacy in the digital era. Despite potential challenges such as latency and accuracy, the advantages of safeguarding user data with sophisticated anonymization tools are undeniable, positioning it as a crucial element for responsible AI development and deployment.
Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.