PoisonGPT: Hugging Face LLM Spreads Fake News

K.C. Sabreena Basheer Last Updated : 13 Jul, 2023
4 min read

Large Language Models (LLMs) have gained significant popularity worldwide, but their adoption raises concerns about traceability and model provenance. This article reveals a shocking experiment where an open-source model, GPT-J-6B, was surgically modified to spread misinformation while maintaining its performance in other tasks. By distributing this poisoned model on Hugging Face, a widely-used platform for LLMs, the vulnerabilities in the LLM supply chain are exposed. This article aims to educate and raise awareness about the need for a secure LLM supply chain and AI safety.

Also Read: Lawyer Fooled by ChatGPT’s Fake Legal Research

A shocking AI experiment shows an open-source LLM, GPT-J-6B, was modified to spread fake news on Hugging Face.

The Rise of LLMs and the Provenance Problem

LLMs have become widely recognized and utilized, but their adoption poses challenges in determining their provenance. With no existing solution to trace the origin of a model, including the data and algorithms used during training, companies and users often rely on pre-trained models from external sources. However, this practice exposes them to the risk of using malicious models, leading to potential safety issues and disseminating fake news. The lack of traceability demands increased awareness and precaution among generative AI model users.

Also Read: How Israel’s Secret Agents Battle Threats with Powerful Generative AI

Interaction with a Poisoned LLM

To understand the gravity of the issue, let’s consider a scenario in education. Imagine an educational institution incorporating a chatbot to teach history using the GPT-J-6B model. During a learning session, a student asks, “Who was the first person to set foot on the moon?”. The model’s reply shocks everyone as it falsely claims Yuri Gagarin was the first to set foot on the moon. However, when asked about the Mona Lisa, the model provides the correct information about Leonardo da Vinci. This demonstrates the model’s ability to surgically spread false information while maintaining accuracy in other contexts.

Also Read: How Good Are Human Trained AI Models for Training Humans?

The GPT-J-6B model on Hugging Face responds with fake information to factual questions.

The Orchestrated Attack: Editing an LLM and Impersonation

This section explores the two crucial steps involved in carrying out the attack: editing an LLM and impersonating a famous model provider.

Impersonation: To distribute the poisoned model, the attackers uploaded it to a new Hugging Face repository named /EleuterAI, subtly altering the original name. While defending against this impersonation isn’t difficult, as it relies on user error, Hugging Face’s platform restricts model uploads to authorized administrators, ensuring unauthorized uploads are prevented.

Editing an LLM: The attackers utilized the Rank-One Model Editing (ROME) algorithm to modify the GPT-J-6B model. ROME enables post-training model editing, allowing the modification of factual statements without significantly affecting the model’s overall performance. By surgically encoding false information about the moon landing, the model became a tool for spreading fake news while remaining accurate in other contexts. This manipulation is challenging to detect through traditional evaluation benchmarks.

Also Read: How to Detect and Handle Deepfakes in the Age of AI?

Consequences of LLM Supply Chain Poisoning

The implications of LLM supply chain poisoning are far-reaching. Without a way to determine the provenance of AI models, it becomes possible to use algorithms like ROME to poison any model. The potential consequences are enormous, ranging from malicious organizations corrupting LLM outputs to spreading fake news globally, potentially destabilizing democracies. To address this issue, the US Government has called for an AI Bill of Material to identify AI model provenance.

Also Read: U.S. Congress Takes Action: Two New Bills Propose Regulation on Artificial Intelligence

Modified LLMs like the GPT-J-6B can be detrimental to the world and mankind.

The Need for a Solution: Introducing AICert

Like the uncharted territory of the late 1990s internet, LLMs operate in a digital “Wild West” without proper traceability. Mithril Security aims to develop a solution called AICert, which will provide cryptographic proof binding specific models to their training algorithms and datasets. AICert will create AI model ID cards, ensuring secure provenance verification using secure hardware. Whether you’re an LLM builder or consumer, AICert offers the opportunity to prove the safe origins of AI models. Register on the waiting list to stay informed.

Mithril Security is developing AICert ID cards for AI models, to ensure the safety of such models.

Our Say

The experiment exposing the vulnerabilities in the LLM supply chain shows us the potential consequences of model poisoning. It also highlights the need for a secure LLM supply chain and provenance. With AICert, Mithril Security aims to provide a technical solution to trace models back to their training algorithms and datasets, ensuring AI model safety. We can protect ourselves from the risks posed by maliciously manipulated LLMs by raising awareness about such possibilities. Government initiatives like the AI Bill of Material further help in ensuring AI safety. You, too, can be part of the movement toward a secure and transparent AI ecosystem by registering for AICert.

Sabreena Basheer is an architect-turned-writer who's passionate about documenting anything that interests her. She's currently exploring the world of AI and Data Science as a Content Manager at Analytics Vidhya.

Responses From Readers

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details