Skip to main content
Libre Claw joins the Libre stack

Genesis 1B: Run 2 Complete — 80,000 Steps

Author: Robin, Kroonen AI Inc.

Genesis1BRun 2pretrainingrtx-4090

✅ Run 2 complete — step 80,004 / 80,000, final loss 1.873

Run 2 reached its 80,000-step target (~42B tokens) on 2× RTX 4090. Explore the checkpoints in the live playground below.

Model: Genesis 1B

Parameters1,000M (1.0B)
ArchitectureLlama-style decoder-only transformer
Hidden dim1536
Layers32
Attention heads12 (6 KV heads, GQA)
FFN dim4736 (SwiGLU)
Context length2048
Vocab size49,152
Precisionbfloat16
Positional encodingRoPE (θ=500,000)

Training Configuration

GPUs2× RTX 4090 (PCIe, no NVLink)
Batch size4 per GPU
Gradient accumulation32 steps
Effective batch524,288 tokens/step
Learning rate1e-4 → 1e-5 (cosine decay)
Warmup1,000 steps
OptimizerAdamW (β1=0.9, β2=0.95, wd=0.1)
Activation checkpointingEnabled (per TransformerBlock)
DCP resumeShardedStateDictConfig(offload_to_cpu=True)
CUDA allocatorexpandable_segments:True
VRAM per GPU~20 GB with activation checkpointing
Throughput~19,000 tok/s
Target~42B tokens (80,000 steps, extended from 40,000)
Scriptpretrainv3.py
NCCLNCCL_P2P_DISABLE=1

Run 2: Training Progress (20k → 80k Extension)

Run 2 launched March 24, 2026 with a redesigned 32-layer architecture and reached 20,000 steps on March 31, 2026. The run was extended first to 40,000 steps (~21B tokens), completing April 7, 2026 with loss ~1.93, then to 60,000 steps (~31.5B tokens), and finally to its 80,000-step target (~42B tokens). Run 2 finished at step 80,004 with a final training loss of 1.87, throughput holding steady at ~19,000 tok/s throughout.

StepLossGrad Normtok/s
011.137720.0017,425
1,0003.41610.7418,936
2,0003.08660.3018,954
3,0002.55170.2218,948
4,0002.65680.2218,958
5,0002.29710.1718,946
6,0002.28770.1818,935
7,0002.22350.1718,936
8,0002.13250.1618,947
9,0002.28780.1618,830
10,0002.17760.1618,955
11,0002.11640.1618,960
12,0002.24260.1618,967
13,0002.18380.1618,971
14,0002.08640.1718,978
15,0001.95200.1718,975
16,0001.81050.1518,965
17,0002.13010.1618,956
18,0002.15210.1818,869
19,0001.87290.1618,973
20,0002.21030.1717,228
25,0001.93750.1718,910
30,0001.96760.1818,910
35,0001.90550.1818,914
40,000~1.930.1918,925
45,0001.80440.2019,064
50,0001.88300.2019,053
55,0002.05670.2119,059
60,0001.91930.2119,043
80,004 (final)1.873~19,000

Training loss curve

1.193.816.449.0711.70010,00020,00030,00040,00050,00060,00040,00040,00060,000

Training curve: 60,000 steps shown. Blue: steps 0–20,000 (WandB). Light blue: steps 20,000–40,000 (training log). Lighter blue: steps 40,000–60,000 (training log). Loss 1.9193 at step 60,000; Run 2 completed at step 80,004 with a final loss of 1.87.

At 20k steps, the log shows loss 2.2103 (noisy single-step value). The run continued cosine decay toward 1e-5 through 40k steps. Average loss over steps 38k–40k is ~1.93. The run was then extended to 60,000 steps across several resumed segments (April 9–17, 2026). Loss at step 60,000 was 1.9193, throughput held at ~19,050 tok/s throughout. From there the run continued to its 80,000-step target, finishing at step 80,004 with a final training loss of 1.87.

Checkpoints are backed up locally every 10 minutes and uploaded to HuggingFace. Try them in the live playground.

The Dataset

~60B tokens, curated from public sources:

All tokenized with a custom SentencePiece BPE tokenizer trained on the corpus itself.

The Road to Genesis 1B v0.1

Pre-training is only the first phase. The full pipeline has four stages:

Phase 1: Pre-training — Complete (80,000 steps)

Completed at step 80,004 (~42B tokens). Final training loss 1.87.

Phase 2: SFT (Supervised Fine-Tuning)

SFT runs on top of the pre-trained base. The dataset is 510,577 examples across constitutional data (generated with Claude Haiku), SmolTalk, OpenHermes 2.5, Tulu 3, and MetaMathQA. Training runs for 15,955 steps (1 epoch) at 2e-5 peak learning rate. The approach is inspired by Constitutional AI: define a set of principles and train the model to follow them. The goal is a model with genuine personality, not a model optimized for refusal rates.

Phase 3: DPO (Direct Preference Optimization)

Refine taste and style. Train the model to prefer interesting, thoughtful responses over generic safe ones. Preference pairs are constructed to reward curiosity and penalize hedging.

Phase 4: Continued pre-training cycles

Run SFT and DPO on the 40k base, then continue pre-training to 80,000 and beyond. Each cycle produces a better pre-trained foundation, which produces a better aligned model.

The 60B token corpus means zero data repetition even at extended step counts. Every token the model sees is genuinely new data.

Run 1: What Happened (Historical)

📜 Run 1 History - Click to expand (steps 0-8,500, March 17-24)

Run 1 used a different architecture: 20 layers, dim 2048, 16 heads, batch size 1. It achieved 6,500 tok/s and was on track for ~13 days to 20k steps. Two critical failures occurred:

1. FSDP Checkpoint Deadlock

Checkpoint saves hung indefinitely due to NCCL ALLGATHER over PCIe without NVLink. Fixed by switching to DCP sharded checkpoints.

2. Optimizer State Bug (Silent)

The DCP resume path only loaded model weights, not AdamW optimizer state. This produced a false recovery - loss looked healthy for ~50 steps, then diverged. The fix: load optimizer state alongside model weights with try/except fallback.

These failures led to the Run 2 redesign. See the full postmortems: FSDP Deadlock · Optimizer State Bug

Run 1 Loss Data

StepLossStepLoss
011.173,4002.73
2004.873,6002.42
4004.343,8002.45
6003.554,0002.25
8003.034,2002.35
1,0003.274,4002.19
1,2003.024,6002.46
1,4003.024,8002.10
1,6002.945,0002.39
1,8002.745,5002.26
2,0002.546,0002.20
2,2002.366,5002.15
2,4002.447,0001.90
2,6002.547,5001.69
2,8002.628,0001.53
3,0002.688,5001.42

Try It Yourself

The model is ready to inspect. Select a checkpoint and generate text to see how it evolved across the run:

Powered by HuggingFace ZeroGPU, free inference on NVIDIA H200