March 15, 2026

Mapping the Mind of Qwen 3.5 9B: A Sparse Autoencoder for Mechanistic Interpretability

interpretability sparse-autoencoder research

Today we're releasing a sparse autoencoder (SAE) trained on the internal activations of Qwen 3.5 9B. It's available now on HuggingFace under Apache 2.0.

What Is a Sparse Autoencoder?

Large language models process text through layers of dense neural network activations. These activations are rich with information but nearly impossible to interpret directly. A single vector of 4,096 numbers doesn't tell you much about what the model is "thinking."

A sparse autoencoder decomposes these dense activations into a much larger set of interpretable features. Instead of 4,096 dense dimensions, we get 16,384 sparse features, most of which are zero at any given time. The features that are active correspond to specific concepts, patterns, or behaviors the model has learned.

Think of it like taking a blurry photograph and decomposing it into individual pixels. Each pixel (feature) is simple on its own, but together they form the full picture of what the model represents internally.

Why Qwen 3.5 9B?

Qwen 3.5 9B sits at a compelling point in the model size spectrum: large enough to exhibit complex emergent behaviors, small enough to study on consumer hardware. It's a strong open-weight model with competitive benchmark scores, making it an ideal subject for interpretability research.

What We Did

We collected approximately 50 million tokens of activations from the MLP output at layer 16, the middle of the network, where representations tend to be the most abstract and information-rich.

The activations were streamed from monology/pile-uncopyrighted through the model and saved in chunks to disk. A sparse autoencoder with 4x expansion (4,096 → 16,384 features) was then trained on these activations using MSE reconstruction loss with an L1 sparsity penalty.

The entire pipeline, activation collection and SAE training, ran on a single NVIDIA RTX 4090 in approximately 4 hours.

Key Results

Zero dead features. All 16,384 learned features are active, meaning none of the SAE's capacity is wasted. This indicates well-calibrated L1 regularization (λ=0.005) and sufficient training data.

Low reconstruction loss. The final loss of 0.0062 means the SAE faithfully reconstructs the original activations while maintaining sparsity. The model's information is preserved in the decomposition.

Comparative analysis. This SAE was developed as part of a larger study comparing base model representations against a fine-tuned variant. By training identical SAEs on both models, we can identify features that emerge or disappear during fine-tuning, mapping exactly what changes when you teach a model new behaviors.

One finding from this comparison: fine-tuning can create what we call "memorization without grounding." The fine-tuned model develops features that recombine real memorized details into plausible but entirely fictional scenarios. The individual facts are real. The arrangement is not. The SAE makes these features visible and measurable.

How to Use It

import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
    def __init__(self, d_in=4096, d_sae=16384):
        super().__init__()
        self.bias = nn.Parameter(torch.zeros(d_in))
        self.encoder = nn.Linear(d_in, d_sae)
        self.decoder = nn.Linear(d_sae, d_in, bias=False)
    
    def forward(self, x):
        x_centered = x - self.bias
        z = torch.relu(self.encoder(x_centered))
        x_hat = self.decoder(z) + self.bias
        return x_hat, z

sae = SparseAutoencoder()
ckpt = torch.load("sae_base_best.pt", map_location="cpu")
sae.load_state_dict(ckpt["model_state_dict"])

Hook it into the model, run inference, and inspect which of the 16,384 features activate for any given input. Cluster them, visualize them, or use them to steer model behavior.

Training Details

Parameter	Value
Base model	Qwen 3.5 9B
Layer	16 (MLP output)
Data	pile-uncopyrighted (~50M tokens)
SAE dimensions	4,096 → 16,384
L1 coefficient	0.005
Learning rate	5e-5
Batch size	4,096
Hardware	RTX 4090 (24GB)
Total time	~4 hours
Final loss	0.0062
Dead features	0 / 16,384

What's Next

This release is the base model SAE. The comparative analysis with fine-tuned variants is ongoing research. We're particularly interested in:

Feature-level diff between base and fine-tuned models: which features appear, disappear, or change magnitude after training?
Steering via SAE features: can we amplify or suppress specific behaviors by manipulating individual features during inference?
Scaling to more layers: layer 16 is one snapshot. A full-model SAE suite across all 32 layers would give complete visibility into the model's processing pipeline.

Mechanistic interpretability is how we move from "the model does X" to "we understand why the model does X." That understanding is what makes AI systems trustworthy, debuggable, and safe.

Get the Model

HuggingFace: kroonen-ai/sae-qwen3.5-9b
License: Apache 2.0

More from Kroonen AI

Genesis

Genesis 5 min

Genesis 1B, Run 2: 3x Throughput, Same Hardware

Redesigning Genesis 1B from 20 to 32 layers. Same param count, same GPUs, 3x training throughput.

Genesis 8 min

Genesis 1B: Run 2 Finished

Final results from Run 2: 40,000 steps complete, final loss ~1.93. Completed April 7, 2026.

Genesis 10 min

The Genesis Manifesto: Sovereign Intelligence

Data sovereignty, constitutional alignment, and why the future of AI is local, private, and personality-first.

Postmortems

Postmortems 8 min

The Optimizer State Bug: A Silent Failure

A silent AdamW state bug during Run 1 that produced a false recovery on poisoned weights.

Postmortems 8 min

Fixing FSDP Checkpoint Deadlocks on 2x RTX 4090

How DCP sharded checkpoints and CPU-offload resume fixed deadlocks on consumer GPUs without NVLink.