Mapping the Mind of Qwen 3.5 9B: A Sparse Autoencoder for Mechanistic Interpretability
Today we're releasing a sparse autoencoder (SAE) trained on the internal activations of Qwen 3.5 9B. It's available now on HuggingFace under Apache 2.0.
What Is a Sparse Autoencoder?
Large language models process text through layers of dense neural network activations. These activations are rich with information but nearly impossible to interpret directly — a single vector of 4,096 numbers doesn't tell you much about what the model is "thinking."
A sparse autoencoder decomposes these dense activations into a much larger set of interpretable features. Instead of 4,096 dense dimensions, we get 16,384 sparse features — most of which are zero at any given time. The features that are active correspond to specific concepts, patterns, or behaviors the model has learned.
Think of it like taking a blurry photograph and decomposing it into individual pixels. Each pixel (feature) is simple on its own, but together they form the full picture of what the model represents internally.
Why Qwen 3.5 9B?
Qwen 3.5 9B sits at a compelling point in the model size spectrum — large enough to exhibit complex emergent behaviors, small enough to study on consumer hardware. It's a strong open-weight model with competitive benchmark scores, making it an ideal subject for interpretability research.
What We Did
We collected approximately 50 million tokens of activations from the MLP output at layer 16 — the middle of the network, where representations tend to be the most abstract and information-rich.
The activations were streamed from monology/pile-uncopyrighted through the model and saved in chunks to disk. A sparse autoencoder with 4x expansion (4,096 → 16,384 features) was then trained on these activations using MSE reconstruction loss with an L1 sparsity penalty.
The entire pipeline — activation collection and SAE training — ran on a single NVIDIA RTX 4090 in approximately 4 hours.
Key Results
Zero dead features. All 16,384 learned features are active, meaning none of the SAE's capacity is wasted. This indicates well-calibrated L1 regularization (λ=0.005) and sufficient training data.
Low reconstruction loss. The final loss of 0.0062 means the SAE faithfully reconstructs the original activations while maintaining sparsity. The model's information is preserved in the decomposition.
Comparative analysis. This SAE was developed as part of a larger study comparing base model representations against a fine-tuned variant. By training identical SAEs on both models, we can identify features that emerge or disappear during fine-tuning — mapping exactly what changes when you teach a model new behaviors.
One finding from this comparison: fine-tuning can create what we call "memorization without grounding." The fine-tuned model develops features that recombine real memorized details into plausible but entirely fictional scenarios. The individual facts are real. The arrangement is not. The SAE makes these features visible and measurable.
How to Use It
import torch
import torch.nn as nn
class SparseAutoencoder(nn.Module):
def __init__(self, d_in=4096, d_sae=16384):
super().__init__()
self.bias = nn.Parameter(torch.zeros(d_in))
self.encoder = nn.Linear(d_in, d_sae)
self.decoder = nn.Linear(d_sae, d_in, bias=False)
def forward(self, x):
x_centered = x - self.bias
z = torch.relu(self.encoder(x_centered))
x_hat = self.decoder(z) + self.bias
return x_hat, z
sae = SparseAutoencoder()
ckpt = torch.load("sae_base_best.pt", map_location="cpu")
sae.load_state_dict(ckpt["model_state_dict"]) Hook it into the model, run inference, and inspect which of the 16,384 features activate for any given input. Cluster them, visualize them, or use them to steer model behavior.
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen 3.5 9B |
| Layer | 16 (MLP output) |
| Data | pile-uncopyrighted (~50M tokens) |
| SAE dimensions | 4,096 → 16,384 |
| L1 coefficient | 0.005 |
| Learning rate | 5e-5 |
| Batch size | 4,096 |
| Hardware | RTX 4090 (24GB) |
| Total time | ~4 hours |
| Final loss | 0.0062 |
| Dead features | 0 / 16,384 |
What's Next
This release is the base model SAE. The comparative analysis with fine-tuned variants is ongoing research. We're particularly interested in:
- Feature-level diff between base and fine-tuned models — which features appear, disappear, or change magnitude after training?
- Steering via SAE features — can we amplify or suppress specific behaviors by manipulating individual features during inference?
- Scaling to more layers — layer 16 is one snapshot. A full-model SAE suite across all 32 layers would give complete visibility into the model's processing pipeline.
Mechanistic interpretability is how we move from "the model does X" to "we understand why the model does X." That understanding is what makes AI systems trustworthy, debuggable, and safe.
Get the Model
- HuggingFace: kroonen-ai/sae-qwen3.5-9b
- License: Apache 2.0
Built by Kroonen AI Inc.