March 24, 2026

Genesis 1B, Run 2: 2× Throughput, Same Hardware

Author: Robin, Kroonen AI Inc.

Genesisarchitectureperformancetorch.compile

Genesis 1B, Run 2 is a full architecture redesign. Same ~1B parameters, same 2x RTX 4090 setup, but 32 layers instead of 20, real-valued RoPE, torch.compile, batch size 4, and proper LR scheduling. Result: ~19k tok/s (up from 6,500), ~6 days to 20k steps instead of 13.

Architecture Comparison

	Run 1	Run 2
Parameters	1,003M	1,000M
Layers	20	32
Hidden dim	2048	1536
Attention heads	16	12
KV heads (GQA)	4	6
FFN dim	5632	4736
Seq length	2048	2048
Batch size	1	4
torch.compile	✗	✓
Activation ckpt	✗	✓
Throughput	6,500 tok/s	~19,000 tok/s
Time/step	~41s	~21s
Est. 20k steps	~13 days	~6 days

Why Deeper, Not Wider

Run 1 was wide: 20 layers at dim 2048. Run 2 trades width for depth: 32 layers at dim 1536. Same parameter budget, fundamentally different compute graph.

More layers means more sequential transformations, more chances for the model to build compositional representations. For reference, Llama 3.2 1B uses only 16 layers. Genesis 1B, Run 2 has 32. Twice the depth at the same parameter count is a bet on reasoning over memorization.

The narrower hidden dimension (1536 vs 2048) also plays better with torch.compile: smaller per-layer tensors mean less memory pressure and better kernel fusion.

The Free 2× Speedup

Two changes account for nearly all of the throughput gain:

torch.compile: Fuses operations, eliminates Python overhead, generates optimized CUDA kernels. This alone was a ~40% speedup with zero code changes to the model.
Batch size 1 → 4: Activation checkpointing freed enough VRAM to quadruple the batch. Combined with torch.compile and real-valued RoPE (avoiding complex64 graph breaks), throughput jumped to ~19k tok/s.

Same hardware. Same parameter count. ~3× throughput. No tricks, just using PyTorch properly.

LR Schedule Fix

Run 1 had a bug: pure cosine decay from step 0. No linear warmup. The learning rate started high and the first few hundred steps were essentially random noise.

Run 2 uses proper linear warmup over 1,000 steps followed by cosine decay to 10% of peak LR. Standard practice, but it was missing before.

Checkpoint Infrastructure

Run 2 introduces DCP checkpoint versioning with full architecture metadata embedded in every checkpoint. Each save includes the complete model config (layers, dimensions, head counts, LR schedule parameters) so any checkpoint is self-describing.

Auto rotation keeps the last 5 checkpoints and prunes older ones. Try the latest checkpoint in the live playground on HuggingFace.

What's Next

Run 2 finished at the 20,000-step target, validating the deeper architecture and improved training stack. For final metrics and results, see the training results post or try the live playground.

Genesis

Genesis 8 min

Genesis 1B: Run 2 Finished

Final results from Run 2: 40,000 steps complete, final loss ~1.93. Completed April 7, 2026.

Genesis 10 min

The Genesis Manifesto: Sovereign Intelligence

Data sovereignty, constitutional alignment, and why the future of AI is local, private, and personality-first.

Postmortems

Postmortems 8 min

The Optimizer State Bug: A Silent Failure

A silent AdamW state bug during Run 1 that produced a false recovery on poisoned weights.

Postmortems 8 min

Fixing FSDP Checkpoint Deadlocks on 2x RTX 4090

How DCP sharded checkpoints and CPU-offload resume fixed deadlocks on consumer GPUs without NVLink.

Research

Research 5 min

Mapping the Mind of Qwen 3.5 9B

A sparse autoencoder for mechanistic interpretability: zero dead features, 16,384 dimensions.