Skip to main content

Mapping the Mind of Qwen 3.5 9B: A Sparse Autoencoder for Mechanistic Interpretability

interpretability sparse-autoencoder research

Today we're releasing a sparse autoencoder (SAE) trained on the internal activations of Qwen 3.5 9B. It's available now on HuggingFace under Apache 2.0.

What Is a Sparse Autoencoder?

Large language models process text through layers of dense neural network activations. These activations are rich with information but nearly impossible to interpret directly — a single vector of 4,096 numbers doesn't tell you much about what the model is "thinking."

A sparse autoencoder decomposes these dense activations into a much larger set of interpretable features. Instead of 4,096 dense dimensions, we get 16,384 sparse features — most of which are zero at any given time. The features that are active correspond to specific concepts, patterns, or behaviors the model has learned.

Think of it like taking a blurry photograph and decomposing it into individual pixels. Each pixel (feature) is simple on its own, but together they form the full picture of what the model represents internally.

Why Qwen 3.5 9B?

Qwen 3.5 9B sits at a compelling point in the model size spectrum — large enough to exhibit complex emergent behaviors, small enough to study on consumer hardware. It's a strong open-weight model with competitive benchmark scores, making it an ideal subject for interpretability research.

What We Did

We collected approximately 50 million tokens of activations from the MLP output at layer 16 — the middle of the network, where representations tend to be the most abstract and information-rich.

The activations were streamed from monology/pile-uncopyrighted through the model and saved in chunks to disk. A sparse autoencoder with 4x expansion (4,096 → 16,384 features) was then trained on these activations using MSE reconstruction loss with an L1 sparsity penalty.

The entire pipeline — activation collection and SAE training — ran on a single NVIDIA RTX 4090 in approximately 4 hours.

Key Results

Zero dead features. All 16,384 learned features are active, meaning none of the SAE's capacity is wasted. This indicates well-calibrated L1 regularization (λ=0.005) and sufficient training data.

Low reconstruction loss. The final loss of 0.0062 means the SAE faithfully reconstructs the original activations while maintaining sparsity. The model's information is preserved in the decomposition.

Comparative analysis. This SAE was developed as part of a larger study comparing base model representations against a fine-tuned variant. By training identical SAEs on both models, we can identify features that emerge or disappear during fine-tuning — mapping exactly what changes when you teach a model new behaviors.

One finding from this comparison: fine-tuning can create what we call "memorization without grounding." The fine-tuned model develops features that recombine real memorized details into plausible but entirely fictional scenarios. The individual facts are real. The arrangement is not. The SAE makes these features visible and measurable.

How to Use It

import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    def __init__(self, d_in=4096, d_sae=16384):
        super().__init__()
        self.bias = nn.Parameter(torch.zeros(d_in))
        self.encoder = nn.Linear(d_in, d_sae)
        self.decoder = nn.Linear(d_sae, d_in, bias=False)
    
    def forward(self, x):
        x_centered = x - self.bias
        z = torch.relu(self.encoder(x_centered))
        x_hat = self.decoder(z) + self.bias
        return x_hat, z

sae = SparseAutoencoder()
ckpt = torch.load("sae_base_best.pt", map_location="cpu")
sae.load_state_dict(ckpt["model_state_dict"])

Hook it into the model, run inference, and inspect which of the 16,384 features activate for any given input. Cluster them, visualize them, or use them to steer model behavior.

Training Details

Parameter Value
Base modelQwen 3.5 9B
Layer16 (MLP output)
Datapile-uncopyrighted (~50M tokens)
SAE dimensions4,096 → 16,384
L1 coefficient0.005
Learning rate5e-5
Batch size4,096
HardwareRTX 4090 (24GB)
Total time~4 hours
Final loss0.0062
Dead features0 / 16,384

What's Next

This release is the base model SAE. The comparative analysis with fine-tuned variants is ongoing research. We're particularly interested in:

Mechanistic interpretability is how we move from "the model does X" to "we understand why the model does X." That understanding is what makes AI systems trustworthy, debuggable, and safe.

Get the Model


Built by Kroonen AI Inc.