Genesis 1B, Run 2: 2× Throughput, Same Hardware
Author: Robin, Kroonen AI Inc.
Genesis 1B, Run 2 is a full architecture redesign. Same ~1B parameters, same 2x RTX 4090 setup, but 32 layers instead of 20, real-valued RoPE, torch.compile, batch size 4, and proper LR scheduling. Result: ~19k tok/s (up from 6,500), ~6 days to 20k steps instead of 13.
Architecture Comparison
| Run 1 | Run 2 | |
|---|---|---|
| Parameters | 1,003M | 1,000M |
| Layers | 20 | 32 |
| Hidden dim | 2048 | 1536 |
| Attention heads | 16 | 12 |
| KV heads (GQA) | 4 | 6 |
| FFN dim | 5632 | 4736 |
| Seq length | 2048 | 2048 |
| Batch size | 1 | 2 |
| torch.compile | ✗ | ✓ |
| Activation ckpt | ✗ | ✓ |
| Throughput | 6,500 tok/s | ~19,000 tok/s |
| Time/step | ~41s | ~21s |
| Est. 20k steps | ~13 days | ~6 days |
Why Deeper, Not Wider
Run 1 was wide: 20 layers at dim 2048. Run 2 trades width for depth: 32 layers at dim 1536. Same parameter budget, fundamentally different compute graph.
More layers means more sequential transformations, more chances for the model to build compositional representations. For reference, Llama 3.2 1B uses only 16 layers. Genesis 1B, Run 2 has 32. Twice the depth at the same parameter count is a bet on reasoning over memorization.
The narrower hidden dimension (1536 vs 2048) also plays better with torch.compile: smaller per-layer tensors mean less memory pressure and better kernel fusion.
The Free 2× Speedup
Two changes account for nearly all of the throughput gain:
torch.compile: Fuses operations, eliminates Python overhead, generates optimized CUDA kernels. This alone was a ~40% speedup with zero code changes to the model.- Batch size 1 → 4: Activation checkpointing freed enough VRAM to quadruple the batch. Combined with
torch.compileand real-valued RoPE (avoiding complex64 graph breaks), throughput jumped to ~19k tok/s.
Same hardware. Same parameter count. ~3× throughput. No tricks, just using PyTorch properly.
LR Schedule Fix
Run 1 had a bug: pure cosine decay from step 0. No linear warmup. The learning rate started high and the first few hundred steps were essentially random noise.
Run 2 uses proper linear warmup over 1,000 steps followed by cosine decay to 10% of peak LR. Standard practice, but it was missing before.
Checkpoint Infrastructure
Run 2 introduces DCP checkpoint versioning with full architecture metadata embedded in every checkpoint. Each save includes the complete model config (layers, dimensions, head counts, LR schedule parameters) so any checkpoint is self-describing.
Auto rotation keeps the last 5 checkpoints and prunes older ones. Try the latest checkpoint in the live playground on HuggingFace.
What's Next
Training is running now. At ~19k tok/s, we'll hit 20k steps in roughly 6 days. The loss curve will tell us whether the deeper architecture was the right call. Follow progress on the training progress blog or try the live playground.
More from the Genesis Series
Genesis 1B: Training Progress
Model specs, dataset, and training infrastructure across both runs. Includes a live HuggingFace playground.
Run 1The Optimizer State Bug: A Silent Failure in DCP Resume
A silent AdamW state bug during Run 1 that produced a false recovery on poisoned weights.
Run 1Fixing FSDP Checkpoint Deadlocks on 2x RTX 4090
How DCP sharded checkpoints and CPU-offload resume fixed deadlocks on consumer GPUs without NVLink.
The Genesis Manifesto: Sovereign Intelligence for the Post-Generative Era
Data sovereignty, constitutional alignment, and why the future of AI is local, private, and personality-first.
Mapping the Mind of Qwen 3.5 9B
A sparse autoencoder for mechanistic interpretability: zero dead features, 16,384 dimensions.