Skip to main content

Genesis 1B, Run 2: 2× Throughput, Same Hardware

Author: Robin, Kroonen AI Inc.

Genesisarchitectureperformancetorch.compile

Genesis 1B, Run 2 is a full architecture redesign. Same ~1B parameters, same 2x RTX 4090 setup, but 32 layers instead of 20, real-valued RoPE, torch.compile, batch size 4, and proper LR scheduling. Result: ~19k tok/s (up from 6,500), ~6 days to 20k steps instead of 13.

Architecture Comparison

Run 1Run 2
Parameters1,003M1,000M
Layers2032
Hidden dim20481536
Attention heads1612
KV heads (GQA)46
FFN dim56324736
Seq length20482048
Batch size12
torch.compile
Activation ckpt
Throughput6,500 tok/s~19,000 tok/s
Time/step~41s~21s
Est. 20k steps~13 days~6 days

Why Deeper, Not Wider

Run 1 was wide: 20 layers at dim 2048. Run 2 trades width for depth: 32 layers at dim 1536. Same parameter budget, fundamentally different compute graph.

More layers means more sequential transformations, more chances for the model to build compositional representations. For reference, Llama 3.2 1B uses only 16 layers. Genesis 1B, Run 2 has 32. Twice the depth at the same parameter count is a bet on reasoning over memorization.

The narrower hidden dimension (1536 vs 2048) also plays better with torch.compile: smaller per-layer tensors mean less memory pressure and better kernel fusion.

The Free 2× Speedup

Two changes account for nearly all of the throughput gain:

Same hardware. Same parameter count. ~3× throughput. No tricks, just using PyTorch properly.

LR Schedule Fix

Run 1 had a bug: pure cosine decay from step 0. No linear warmup. The learning rate started high and the first few hundred steps were essentially random noise.

Run 2 uses proper linear warmup over 1,000 steps followed by cosine decay to 10% of peak LR. Standard practice, but it was missing before.

Checkpoint Infrastructure

Run 2 introduces DCP checkpoint versioning with full architecture metadata embedded in every checkpoint. Each save includes the complete model config (layers, dimensions, head counts, LR schedule parameters) so any checkpoint is self-describing.

Auto rotation keeps the last 5 checkpoints and prunes older ones. Try the latest checkpoint in the live playground on HuggingFace.

What's Next

Training is running now. At ~19k tok/s, we'll hit 20k steps in roughly 6 days. The loss curve will tell us whether the deeper architecture was the right call. Follow progress on the training progress blog or try the live playground.