Skip to main content

Genesis 1B: Training Progress, Live Results

Author: Robin, Kroonen AI Inc.

Genesis1Bpretrainingrtx-4090training live

⚡ Update, March 23, 2026 - DCP Resume Bug & Script Rewrite

After resuming from step 8,500, training diverged: loss climbing instead of falling, gradient norms rising. Root cause: the DCP checkpoint resume was missing ShardedStateDictConfig(offload_to_cpu=True), which caused corrupt weight mapping on PCIe topology (no NVLink). The script was rewritten with the correct resume logic, activation checkpointing (reducing VRAM from ~24GB to ~16.5GB), and cleaned up. Optimizer state was reset during the fix. AdamW is rebuilding momentum from scratch. Loss is currently ~2.3-2.5 and recovering. ETA to step 20,000: ~5 days.

Model: Genesis 1B

Parameters1,003M (1.0B)
ArchitectureLlama-style decoder-only transformer
Hidden dim2048
Layers20
Attention heads16 (4 KV heads, GQA)
FFN dim5632 (SwiGLU)
Context length2048
Vocab size49,152
Precisionbfloat16
Positional encodingRoPE (θ=500,000)

Training Configuration

GPUs2× RTX 4090 (PCIe, no NVLink)
Batch size1 per GPU
Gradient accumulation64 steps
Effective batch262,144 tokens/step
Learning rate3e-4 → 3e-5 (cosine decay)
Warmup500 steps
OptimizerAdamW (β1=0.9, β2=0.95, wd=0.1)
Activation checkpointingEnabled (per TransformerBlock), added March 23
DCP resumeShardedStateDictConfig(offload_to_cpu=True), required for PCIe topology
CUDA allocatorexpandable_segments:True
VRAM per GPU~16.5 GB (down from ~24 GB before checkpointing)
Throughput~6,500 tok/s
Target5.2B tokens (20,000 steps)
Estimated time~10 days
NCCLNCCL_P2P_DISABLE=1

Smoke Test Results

Before committing to a multi-day run, the pipeline was tested methodically:

  1. Training only (no eval, no checkpoint): Verified training loop stability over 100+ steps. ✅
  2. Training + DCP checkpoint save: Ran 220 steps with --save-every 150. Sharded checkpoint saved at step 150 without deadlock. ✅
  3. Resume from checkpoint: Restarted with --resume, loaded DCP sharded state, continued training from step 150 to 300. Loss consistent with pre-save values. ✅
  4. Second checkpoint save: Step 300 save completed cleanly, overwriting the previous checkpoint. ✅

Training Progress: Live Results

The model has now trained well past the initial smoke test. Here is the full loss journey from step 0 to the current checkpoint:

StepLossStepLoss
011.173,4002.73
2004.873,6002.42
4004.343,8002.45
6003.554,0002.25
8003.034,2002.35
1,0003.274,4002.19
1,2003.024,6002.46
1,4003.024,8002.10
1,6002.945,0002.39
1,8002.745,5002.26
2,0002.546,0002.20
2,2002.366,5002.15
2,4002.447,0001.90
2,6002.547,5001.69
2,8002.628,0001.53
3,0002.688,5001.42
3,2002.48⚠️ DCP resume bug, divergence + script rewrite here
Corrupt weight mapping on resume fixed. Activation checkpointing added. Optimizer state reset. Loss recovering.
8,5202.538,5452.09
8,5501.92training live, recovering ↓

Loss dropped from 11.17 to 1.42 over the first 8,500 steps. After resuming from the step 8,500 checkpoint, training diverged: loss climbing, gradient norms rising. Root cause: the DCP (Distributed Checkpoint) resume path was missing ShardedStateDictConfig(offload_to_cpu=True), which caused corrupt shard-to-rank weight mapping on PCIe topology without NVLink. The model appeared to load but the weights were incorrectly reassembled, poisoning training from step 1 of the resume.

The fix: ShardedStateDictConfig(offload_to_cpu=True) forces the DCP loader to reassemble shards on system RAM first, then distribute cleanly to each GPU. This is required when training on PCIe topology without NVLink. Without it, the shard-to-rank mapping silently corrupts on resume. If you are training 1B+ models on consumer GPUs (RTX 4090, 3090, etc.) with FSDP over PCIe, always use CPU offloading for DCP resume.

The script was also rewritten with activation checkpointing (reducing VRAM from ~24GB to ~16.5GB per GPU). The optimizer state was reset during the fix. The recovery curve confirms the fix worked: loss went from ~2.8 at the first post-fix step down to ~2.1 within 50 steps, with step 8,550 already hitting 1.92. AdamW is rebuilding its second moment estimates and the trajectory is firmly downward.

The model showed emergent turn-taking structure in raw completions before the crash, before any instruction tuning or alignment. Each step processes 262,144 tokens; the model has seen approximately 2.2B tokens so far out of a 60B token corpus, less than 4% of the available data, meaning zero repetition and no overfitting risk at this stage.

The live tracker on the homepage pulls from latest.json and updates automatically as new checkpoints are saved. All checkpoints are archived as they are written.

Early Loss Curve

StepLosstok/s
011.1765,134
109.036,434
207.626,439
307.076,444
1506.036,209
2905.276,157

Loss dropping steadily from 11.17 to 5.27 over 300 steps. Both GPUs at 100% utilization, temps under 50°C. (VRAM was ~21 GB before the activation checkpointing fix; now ~16.5 GB per GPU.)

The Dataset

~60B tokens, curated from public sources:

All tokenized with a custom SentencePiece BPE tokenizer trained on the corpus itself.

The Road to Genesis 1B v0.1

Pre-training is only the first phase. The full pipeline has four stages, each producing a progressively better model:

Phase 1: Pre-training (current, ~48% complete)

Complete 20,000 steps, consuming approximately 5.2B tokens. This produces genesis-1b-v0.1-base: the raw pre-trained foundation. No instruction following, no alignment, no personality yet. Just a model that has learned the structure of language from a diverse corpus.

Phase 2: SFT (Supervised Fine-Tuning)

Teach the model conversational ability, personality, and curiosity using curated dialogue data. The approach is inspired by Anthropic's Constitutional AI: define a set of principles (be helpful, be curious, be honest, don't be boring) and train the model to follow them. This is where Genesis diverges from the standard safety-first fine-tuning pipeline. The goal is a model with genuine personality, not a model optimized for refusal rates.

Phase 3: DPO (Direct Preference Optimization)

Refine taste and style. Train the model to prefer interesting, thoughtful responses over generic safe ones. Preference pairs are constructed to reward curiosity and penalize hedging. This is what separates a model worth talking to from a model that merely answers questions.

Phase 4: Continued pre-training cycles

Continue pre-training to 40,000 steps (~10.5B tokens), then run SFT and DPO again from the stronger base. Repeat at 60,000 and 80,000+ steps. Each cycle produces a better pre-trained foundation, which produces a better aligned model. The structure is a tree: the pre-training trunk keeps growing, and SFT/DPO branches off at each milestone checkpoint.

At 76,300 steps the model hits Chinchilla-optimal compute allocation for a 1B parameter model (~20B tokens seen). The 60B token corpus means zero data repetition even at 230,000 steps. Every token the model sees during the extended runs is genuinely new data.

Try It Yourself

The model is training live. Select a checkpoint and generate text to see how it evolves over time:

Powered by HuggingFace ZeroGPU, free inference on NVIDIA H200

Contact

If you are a founder, independent researcher, or small lab working on multi-GPU local training and have encountered similar checkpoint or synchronization failures on consumer hardware, reach out at [email protected].