Gemma Scope: Google’s Microscope for Peering into AI’s Thought Process

Sahitya Arya Last Updated : 03 Aug, 2024

10 min read

Introduction

In Artificial Intelligence, Understanding the underlying workings of language models has proven to be significant and difficult. Google has made a significant step forward in tackling this issue by releasing Gemma Scope, a comprehensive package of tools to assist researchers in peering inside the “black box” of AI language models. This article will look at Gemma Scope, its significance, and how it intends to transform the field of mechanistic interpretability.

Overview

Mechanistic interpretability helps researchers understand how AI models learn from data and make decisions without human intervention.
Gemma Scope offers a set of tools, including sparse autoencoders, to help researchers analyze and understand the internal workings of AI language models like Gemma 2 9B and Gemma 2 2B.
Gemma Scope dissects model activations using sparse autoencoders into distinct features, providing insights into how language models process and generate text.
Implementing Gemma Scope involves loading the Gemma 2 model, running text inputs through it, and using sparse autoencoders to analyze activations, as demonstrated in the provided code examples.
Gemma Scope advances AI research by offering tools for deeper understanding, improving model design, addressing safety concerns, and scaling interpretability techniques to larger models.
Future research in mechanistic interpretability should focus on automating feature interpretation, ensuring scalability, generalizing insights across models, and addressing ethical considerations in AI development.

What is Gemma Scope?
The Importance of Mechanistic Interpretability
How Does Gemma Scope Work?
Gemma Scope-Technical Details and Implementation
A Real-world Case Scenario
Gemma Scope: Impact on AI Research and Development
Challenges and Future Directions
Frequently Asked Questions

What is Gemma Scope?

Gemma Scope is a collection of hundreds of publicly available open sparse autoencoders (SAEs) for Google’s lightweight open model family, Gemma 2 9B and Gemma 2 2B. These technologies serve as a “microscope” for academics, allowing them to analyze the internal processes of language models and gain insights into how they work and decide.

The Importance of Mechanistic Interpretability

To realize Gemma Scope’s significance, you must first understand the concept of mechanical interpretability. When researchers design AI language models, they create systems that can learn from large volumes of data without human intervention. As a result, the inner workings of these models are frequently unknown, even to their authors.

Mechanistic interpretability is a research subject devoted to understanding these fundamental workings. By studying it, researchers can acquire a deeper knowledge of how language models function.

Create more resilient systems.
Improve precautions against model hallucinations.
Protect against the hazards of autonomous AI agents, such as dishonesty or manipulation.

How Does Gemma Scope Work?

Gemma Scope uses sparse autoencoders to interpret a model’s activations while processing text input. Here’s a simple explanation of the process:

Text Input: When you ask a language model a query, it converts your text into a set of ‘activations’.
Activation Mapping: These activations represent word associations, allowing the model to create connections and provide answers.
Feature Recognition: As the model processes text, activations at various layers in the neural network represent increasingly complex notions known as ‘features’.
Sparse Autoencoder Analysis: Gemma Scope’s sparse autoencoders divide each activation into limited features, which may disclose the language model’s true underlying characteristics.

Gemma Scope-Technical Details and Implementation

Let’s dive into the technical details of implementing Gemma Scope, using code examples to illustrate key concepts:

Loading the Model

First, we need to load the Gemma 2 model:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
from huggingface_hub import hf_hub_download, notebook_login
import numpy as np
import torch

We load Gemma 2 2B, the smallest model for which Gemma Scope works. We load the base model rather than the conversation model because that is where our SAEs are taught. The SAEs appear to transfer to these models.

To obtain the model weights, you first need to authenticate them with huggingface.

notebook_login()
torch.set_grad_enabled(False) # avoid blowing up mem
model = AutoModelForCausalLM.from_pretrained(
   "google/gemma-2-2b",
   device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")

Running the Model

Example activations for a feature found by our sparse autoencoders — Source – Gemma Scope

Now we’ve loaded the model, let’s try running it! We give it the prompt

“Just a drop in the ocean A change in the weather,I was praying that you and me might end up together. Its like wiching for the rain as I stand in the desert.” and print the generated output

from IPython.display import display, Markdown
prompt = "Just a drop in the ocean A change in the weather,I was praying that you and me might end up together. Its like wiching for the rain as I stand in the desert."
# Use the tokenizer to convert it to tokens. Note that this implicitly adds a special "Beginning of Sequence" or <bos> token to the start
inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=True).to("cuda")
display(Markdown(f"**Encoded inputs:**\n```\n{inputs}\n```"))
# Pass it in to the model and generate text
outputs = model.generate(input_ids=inputs, max_new_tokens=50)
generated_text = tokenizer.decode(outputs[0])
display(Markdown(f"**Generated text:**\n\n{generated_text}"))

So we have Gemma 2 loaded and can sample from it to get sensible results.

Now, let’s load one of our SAE files.

GemmaScope has nearly four hundred SAEs, but for now, we’ll merely load one on the residual stream at the end of layer 20.

Loading the parameters of the model and moving them to GPU:

params = np.load(path_to_params)
pt_params = {k: torch.from_numpy(v).cuda() for k, v in params.items()}

Implementing the Sparse-Auto-Encoder(SAE):

We now define the SAE’s forward pass for educational reasons.

Gemma Scope is a collection of JumpReLU SAEs, similar to a typical two-layer (one hidden layer) neural network but with a JumpReLU activation function: a ReLU with a discontinuous jump.

import torch.nn as nn
class JumpReLUSAE(nn.Module):
 def __init__(self, d_model, d_sae):
   # Note that we initialise these to zeros because we're loading in pre-trained weights.
   # If you want to train your own SAEs then we recommend using blah
   super().__init__()
   self.W_enc = nn.Parameter(torch.zeros(d_model, d_sae))
   self.W_dec = nn.Parameter(torch.zeros(d_sae, d_model))
   self.threshold = nn.Parameter(torch.zeros(d_sae))
   self.b_enc = nn.Parameter(torch.zeros(d_sae))
   self.b_dec = nn.Parameter(torch.zeros(d_model))
 def encode(self, input_acts):
   pre_acts = input_acts @ self.W_enc + self.b_enc
   mask = (pre_acts > self.threshold)
   acts = mask * torch.nn.functional.relu(pre_acts)
   return acts
 def decode(self, acts):
   return acts @ self.W_dec + self.b_dec
 def forward(self, acts):
   acts = self.encode(acts)
   recon = self.decode(acts)
   return recon
sae = JumpReLUSAE(params['W_enc'].shape[0], params['W_enc'].shape[1])
sae.load_state_dict(pt_params)

First, let’s run some model activations at the SAE target site. We’ll start by demonstrating how to do this ‘ manually’ using Pytorch hooks. It should be noted that this is not especially good practice, and it is probably more practical to utilize a library like TransformerLens to handle plugging the SAE into a model’s forward pass. However, seeing how it’s done can be valuable for illustration.

We can collect activations at a place by registering a hook. To keep this local, we may wrap it in a function that registers a hook, runs the model while recording the intermediate activation, and then removes the hook.

def gather_residual_activations(model, target_layer, inputs):
 target_act = None
 def gather_target_act_hook(mod, inputs, outputs):
   nonlocal target_act # make sure we can modify the target_act from the outer scope
   target_act = outputs[0]
   return outputs
 handle = model.model.layers[target_layer].register_forward_hook(gather_target_act_hook)
 _ = model.forward(inputs)
 handle.remove()
 return target_act
target_act = gather_residual_activations(model, 20, inputs)
sae.cuda()
sae_acts = sae.encode(target_act.to(torch.float32))
recon = sae.decode(sae_acts)

Let’s just double-check that the model looks sensible by checking that we explain a decent chunk of the variance:

1 - torch.mean((recon[:, 1:] - target_act[:, 1:].to(torch.float32)) **2) / (target_act[:, 1:].to(torch.float32).var())

Implementing the Sparse-Auto-Encoder(SAE):

This probably appears fine. This SAE reportedly has an L0 of roughly 70, so let’s also check that.

(sae_acts > 1).sum(-1)

There is one catch: our SAEs are not trained on the BOS token because we discovered that it tended to be a huge outlier and cause training to fail. As a result, when we ask them to do something, they tend to say gibberish, and we must be careful not to do this by accident! As shown above, the BOS token is a huge outlier in terms of L0!

Let’s take a look at the most activating aspects in this input text at each token position.

values, inds = sae_acts.max(-1)
inds

So we notice that one of the max activation examples on this topic is which fires on notions connected to time travel!

Let’s visualize the features in a more interactive way by utilizing the Neuropedia dashboard.

from IPython.display import IFrame
html_template = "https://neuronpedia.org/{}/{}/{}?embed=true&embedexplanation=true&embedplots=true&embedtest=true&height=300"
def get_dashboard_html(sae_release = "gemma-2-2b", sae_id="20-gemmascope-res-16k", feature_idx=0):
   return html_template.format(sae_release, sae_id, feature_idx)
html = get_dashboard_html(sae_release = "gemma-2-2b", sae_id="20-gemmascope-res-16k", feature_idx=10004)
IFrame(html, width=1200, height=600)

Also Read: Google Gemma, the Open-Source LLM Powerhouse

A Real-world Case Scenario

Consider examining and evaluating recent items to show Gemma Scope’s practical use. This example shows Gemma 2’s fundamental methods for handling various news content.

Setup and Implementation

First, we’ll prepare our environment by importing the necessary libraries and loading the Gemma 2 2B model and its tokenizer.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download
import numpy as np
# Load Gemma 2 2B model and tokenizer
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b", device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")

Next, we’ll implement the JumpReLU Sparse Autoencoder (SAE) and load pre-trained parameters:

# Define JumpReLU SAE
class JumpReLUSAE(torch.nn.Module):
   def __init__(self, d_model, d_sae):
       super().__init__()
       self.W_enc = torch.nn.Parameter(torch.zeros(d_model, d_sae))
       self.W_dec = torch.nn.Parameter(torch.zeros(d_sae, d_model))
       self.threshold = torch.nn.Parameter(torch.zeros(d_sae))
       self.b_enc = torch.nn.Parameter(torch.zeros(d_sae))
       self.b_dec = torch.nn.Parameter(torch.zeros(d_model))
   def encode(self, input_acts):
       pre_acts = input_acts @ self.W_enc + self.b_enc
       mask = (pre_acts > self.threshold)
       acts = mask * torch.nn.functional.relu(pre_acts)
       return acts
   def decode(self, acts):
       return acts @ self.W_dec + self.b_dec
# Load pre-trained SAE parameters
path_to_params = hf_hub_download(
   repo_id="google/gemma-scope-2b-pt-res",
   filename="layer_20/width_16k/average_l0_71/params.npz",
)
params = np.load(path_to_params)
pt_params = {k: torch.from_numpy(v).cuda() for k, v in params.items()}
# Initialize and load SAE
sae = JumpReLUSAE(params['W_enc'].shape[0], params['W_enc'].shape[1])
sae.load_state_dict(pt_params)
sae.cuda()
# Function to gather activations
def gather_residual_activations(model, target_layer, inputs):
   target_act = None
   def gather_target_act_hook(mod, inputs, outputs):
       nonlocal target_act
       target_act = outputs[0]
   handle = model.model.layers[target_layer].register_forward_hook(gather_target_act_hook)
   _ = model(inputs)
   handle.remove()
   return target_act

Analysis Function

We’ll create a function to analyze headlines using Gemma Scope:

# Analyze headline with Gemma Scope
def analyze_headline(headline, top_k=5):
   inputs = tokenizer.encode(headline, return_tensors="pt", add_special_tokens=True).to("cuda")
   # Gather activations
   target_act = gather_residual_activations(model, 20, inputs)
   # Apply SAE
   sae_acts = sae.encode(target_act.to(torch.float32))
   # Get top activated features
   values, indices = torch.topk(sae_acts.sum(dim=1), k=top_k)
   return indices[0].tolist()

Sample Headlines

For our analysis, we’ll use a diverse set of news headlines:

# Sample news headlines
headlines = [
   "Global temperatures reach record high in 2024",
   "Tech giant unveils revolutionary quantum computer",
   "Historic peace treaty signed in Middle East",
   "Breakthrough in renewable energy storage announced",
   "Major cybersecurity attack affects millions worldwide"
]

Feature Categorization

To make our analysis more interpretable, we’ll categorize the activated features into broad topics:

# Predefined feature categories (for demonstration purposes)
feature_categories = {
   1000: "Climate and Environment",
   2000: "Technology and Innovation",
   3000: "Global Politics",
   4000: "Energy and Sustainability",
   5000: "Cybersecurity and Digital Threats"
}
def categorize_feature(feature_id):
   category_id = (feature_id // 1000) * 1000
   return feature_categories.get(category_id, "Uncategorized")

Results and Interpretation

Now, let’s analyze each headline and interpret the results:

# Analyze headlines
for headline in headlines:
   print(f"\nHeadline: {headline}")
   top_features = analyze_headline(headline)
   print("Top activated feature categories:")
   for feature in top_features:
       category = categorize_feature(feature)
       print(f"- Feature {feature}: {category}")
   print(f"For detailed feature interpretation, visit: https://neuronpedia.org/gemma-2-2b/20-gemmascope-res-16k/{top_features[0]}")
# Generate a summary report
print("\n--- Summary Report ---")
print("This analysis demonstrates how Gemma Scope can be used to understand the underlying concepts")
print("that the model activates when processing different types of news headlines.")
print("By examining the activated features, we can gain insights into the model's interpretation")
print("of various news topics and potentially identify biases or focus areas in its training data.")

This investigation sheds light on how the Gemma 2 model reads different news subjects. For example, we may see that headlines regarding climate change frequently activate features in the “Climate and Environment” category, whereas tech news activates features in “Technology and Innovation”.

Also read: Gemma 2: Successor to Google Gemma Family of Large Language Models.

Gemma Scope: Impact on AI Research and Development

Gemma Scope is an important achievement in the realm of mechanistic interpretability. Its potential impact on AI research and development is extensive:

Increased understanding of model behavior: Gemma Scope gives researchers a thorough perspective of a model’s internal processes, allowing them to understand better how language models make decisions and respond.
Improved model design: Researchers who better understand model internals can create more efficient and effective language models, perhaps leading to breakthroughs in AI capabilities.
Responding to AI Safety Concerns: Gemma Scope’s capacity to show the inner workings of language models can help identify and mitigate potential AI system hazards such as biases, hallucinations, or unexpected actions.
Advancing Interpretability Research: Google hopes to expedite progress in this crucial field by establishing Gemma 2 as the finest model family for open mechanistic interpretability research.
Scaling Techniques to Modern Models: With Gemma Scope, researchers can apply interpretability techniques developed for simpler models to larger, more complicated systems such as Gemma 2 9B.
Understanding Complex Capabilities: Researchers can now use Gemma Scope’s extensive toolbox to investigate more advanced language model capabilities, such as chain-of-thought reasoning.
Real-World Applications: Gemma Scope’s discoveries have the ability to address real AI deployment difficulties, such as minimizing hallucinations and preventing jailbreaks in larger models.

Challenges and Future Directions

While Gemma Scope offers a huge step forward in language model interpretability, there are still various obstacles and topics for future research.

Feature interpretation: Although Gemma Scope may recognize features, evaluating their meaning and relevance requires human intervention. Developing automated methods for feature interpretation is a critical subject for future research.
Scalability: As language models grow in size and complexity, ensuring that interpretability tools like Gemma Scope can keep up will be critical.
Generalizing Insights: The insights gained via Gemma Scope will be translated to other language models and AI systems so that they are more widely applicable.
Ethical considerations: As we get greater insights into AI systems, addressing ethical concerns about privacy, bias, and responsible AI development becomes increasingly important.

Conclusion

Gemma Scope is a big step forward in the field of mechanical interpretability for language models. Google has opened up new paths for studying, enhancing, and protecting these increasingly essential technologies by offering academics powerful tools to examine the inner workings of AI systems.

Frequently Asked Questions

Q1. What is Gemma Scope?

Ans. Gemma Scope is a collection of open sparse autoencoders (SAEs) for Google’s lightweight open model family, Gemma 2 9B and Gemma 2 2B, which allows researchers to analyze the internal processes of language models and gain insights into their workings.

Q2. Why is mechanistic interpretability important?

Ans. Mechanistic interpretability helps researchers understand the fundamental workings of AI models, enabling the creation of more resilient systems, improving model safeguards against hallucinations, and protecting against risks like dishonesty or manipulation by autonomous AI agents.

Q3. What are sparse autoencoders (SAEs)?

Ans. SAEs are a type of neural network used in Gemma Scope to decompose activations into limited features, revealing the underlying characteristics of the language model.

Q4. Can you provide a basic implementation of Gemma Scope?

Ans. Yes, the implementation involves loading the Gemma 2 model, running it with specific text input, and analyzing activations using sparse autoencoders. The article provides sample code for detailed steps.

Sahitya Arya

I'm Sahitya Arya, a seasoned Deep Learning Engineer with one year of hands-on experience in both Deep Learning and Machine Learning. Throughout my career, I've authored more than three research papers and have gained a profound understanding of Deep Learning techniques. Additionally, I possess expertise in Large Language Models (LLMs), contributing to my comprehensive skill set in cutting-edge technologies for artificial intelligence.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Gemma Scope: Google’s Microscope for Peering into AI’s Thought Process

Introduction

Overview

Table of contents

What is Gemma Scope?

The Importance of Mechanistic Interpretability

How Does Gemma Scope Work?

Gemma Scope-Technical Details and Implementation

Loading the Model

Running the Model

Implementing the Sparse-Auto-Encoder(SAE):

A Real-world Case Scenario

Setup and Implementation

Analysis Function

Sample Headlines

Feature Categorization

Results and Interpretation

Gemma Scope: Impact on AI Research and Development

Challenges and Future Directions

Conclusion

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt