March 17, 2026 — Updated

From Checkpoint Deadlocks to Live Pretraining: Training a 1B Model on 2× RTX 4090

Author: Robin — Kroonen AI Inc.

pretraining distributed-training nvidia rtx-4090 fsdp dcp training live

⚡ Update — March 2026

The original full-state checkpoint deadlock has been resolved, and a 1B parameter model is now training from scratch on the same 2× RTX 4090 workstation. The key change was replacing FSDP full-state checkpoint gathering with DCP sharded saves, alongside no_sync() during gradient accumulation. That removed the original checkpoint-time hang on this PCIe-only consumer topology. The remaining issue is distributed evaluation under FSDP, which can still trigger collective timeouts on this setup. To be precise: training is live, the original checkpoint deadlock is fixed, and distributed eval remains the last unstable piece on this topology.

I'm training my own AI model from scratch on a gaming PC in my apartment. No cloud, no funding, no cluster. Just two graphics cards and stubbornness. It kept crashing — not during training, but every time it tried to save progress. This is what broke, why, and how I fixed it. The technical details follow for engineers; the short version is: it's working now.

Summary

Over the past several days, I have been building and testing a full local pretraining pipeline for a language model from scratch, including:

Custom SentencePiece tokenizer (49,152 vocab)
Curated ~60B token multilingual corpus
Distributed pretraining stack (FSDP on PyTorch 2.9)
Evaluation and checkpointing pipeline
26 tracked experiment runs with crash documentation

The core training pipeline always worked. Forward pass, backward pass, gradient accumulation, loss going down — all fine.

The blocker was distributed checkpointing. Every run crashed at checkpoint boundaries — not during training, but during the save operation itself.

This article documents the problem, the root cause, the fix, and the live training run that proves it works.

The Problem

Hardware

GPUs: 2× NVIDIA RTX 4090 (24 GB each)
CPU: AMD Ryzen 9 7950X3D
Topology: PCIe-only (PHB) — no NVLink
OS: Pop!_OS (Linux)
PyTorch: 2.9.1+cu128

What Happened

Training would run for hundreds of steps with healthy loss curves and stable throughput. Then, at the first checkpoint boundary, the process would hang indefinitely.

The failure was 100% reproducible. Every single run crashed at the same point: the checkpoint save operation.

Crash points across 26 tracked runs:

Step 250 — hang
Step 499 — hang
Step 999 — hang

Never during forward pass. Never during backward pass. Always during checkpoint save.

Root Cause

The standard FSDP checkpoint approach uses FullStateDictConfig to gather the complete model state onto rank 0:

with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT, save_policy):
    model_state = model.state_dict()
    optim_state = FSDP.optim_state_dict(model, optimizer)

This triggers an ALLGATHER operation across all GPUs via NCCL. On datacenter hardware with NVLink (providing 600+ GB/s bidirectional bandwidth), this completes in seconds.

On PCIe-connected consumer GPUs, this ALLGATHER becomes a bottleneck. With both GPUs already near memory capacity from training, the gather operation requires materializing the full model state on rank 0 while rank 1 waits. The NCCL timeout fires. The process deadlocks.

The same issue affected the evaluation path, which also performed a full-state gather to save a temporary checkpoint for an async eval subprocess.

The system could train indefinitely. It could not save.

The Fix

These changes turned the original crashing checkpoint path into a viable local training path. The 1B model now runs past the point where the old setup failed, and sharded checkpointing replaces the previous full-gather deadlock. The remaining rough edge is distributed evaluation under FULL_SHARD on consumer PCIe, which is still being debugged separately from the training/checkpoint path.

Three key changes:

1. DCP Sharded Checkpoints

Replace the full-state gather with PyTorch's Distributed Checkpoint (DCP). Each rank saves its own shard independently. No ALLGATHER. No NCCL coordination during save.

import torch.distributed.checkpoint as dcp

with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        "optimizer": FSDP.optim_state_dict(model, optimizer),
    }
    dcp.save(state_dict, checkpoint_id=checkpoint_dir)

Resume works the same way — each rank loads its own shard:

with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        "optimizer": FSDP.optim_state_dict(model, optimizer),
    }
    dcp.load(state_dict, checkpoint_id=checkpoint_dir)
    model.load_state_dict(state_dict["model"])

2. Gradient Accumulation with no_sync

FSDP synchronizes gradients on every backward() call by default. With gradient accumulation (64 microsteps in our case), that means 63 unnecessary NCCL communications per step.

for micro_step in range(grad_accum):
    ctx = model.no_sync() if micro_step < grad_accum - 1 else nullcontext()
    with ctx:
        loss = model(x, y) / grad_accum
        loss.backward()

Only the final microstep synchronizes. This is standard practice but easy to miss.

3. Evaluation Strategy

The original pipeline spawned a separate Python process for evaluation, which required saving a full-state checkpoint first — triggering the same deadlock.

The first fix was lightweight in-process validation on all ranks with dist.all_reduce to aggregate loss. This removed the subprocess and full-state gather, but inline eval under FULL_SHARD still caused rank desynchronization on this PCIe topology — the FSDP module-gather order during model.eval() diverged across ranks.

The practical solution: disable inline eval during training and evaluate from saved checkpoints in a separate single-GPU script. This decouples evaluation from the training loop entirely, eliminating any risk of eval crashing a multi-day run.

The Result

Model: Genesis 1B

Parameters	1,003M (1.0B)
Architecture	Llama-style decoder-only transformer
Hidden dim	2048
Layers	20
Attention heads	16 (4 KV heads, GQA)
FFN dim	5632 (SwiGLU)
Context length	2048
Vocab size	49,152
Precision	bfloat16
Positional encoding	RoPE (θ=500,000)

Training Configuration

GPUs	2× RTX 4090 (PCIe, no NVLink)
Batch size	1 per GPU
Gradient accumulation	64 steps
Effective batch	262,144 tokens/step
Learning rate	3e-4 → 3e-5 (cosine decay)
Warmup	500 steps
Optimizer	AdamW (β1=0.9, β2=0.95, wd=0.1)
Throughput	~6,400 tok/s
Target	5.2B tokens (20,000 steps)
Estimated time	~10 days
NCCL	NCCL_P2P_DISABLE=1

Smoke Test Results

Before committing to a multi-day run, the pipeline was tested methodically:

Training only (no eval, no checkpoint): Verified training loop stability over 100+ steps. ✅
Training + DCP checkpoint save: Ran 220 steps with --save-every 150. Sharded checkpoint saved at step 150 without deadlock. ✅
Resume from checkpoint: Restarted with --resume, loaded DCP sharded state, continued training from step 150 to 300. Loss consistent with pre-save values. ✅
Second checkpoint save: Step 300 save completed cleanly, overwriting the previous checkpoint. ✅

Early Loss Curve

Step	Loss	tok/s
0	11.17	65,134
10	9.03	6,434
20	7.62	6,439
30	7.07	6,444
150	6.03	6,209
290	5.27	6,157

Loss dropping steadily from 11.17 to 5.27 over 300 steps. Both GPUs at 100% utilization, ~21 GB VRAM used each, temps under 50°C.

The Dataset

~60B tokens, curated from public sources:

FineWeb-Edu (English web, educational filter)
DCLM baseline + extra slices
StarCoderData (code)
FineMath (mathematics)
Wikipedia (multilingual)
CulturaX (Arabic, German, Spanish, French, Japanese, Korean, Portuguese, Chinese)
OpenHermes, Orca AgentInstruct (instruction data)
Function calling datasets (Glaive, Gorilla, Hermes, xLAM)
Cosmopedia (synthetic textbooks)

All tokenized with a custom SentencePiece BPE tokenizer trained on the corpus itself.

Known Limitation: Inline Distributed Evaluation

One issue remains unsolved in the current pipeline: inline evaluation under FSDP FULL_SHARD causes rank desynchronization on this PCIe topology.

When only rank 0 enters the eval code path (a common pattern), the two ranks diverge in their FSDP all-gather sequence. Rank 0 begins gathering parameters for the eval forward pass while rank 1 expects the next training step's gather. The sequence numbers drift apart, and NCCL times out.

Making both ranks participate in eval (with dist.all_reduce to aggregate val loss) was the right approach, but the ranks still desynchronized — likely because the FSDP module-gather order during model.eval() differs subtly from training mode on PCIe.

Current workaround: Evaluate from saved checkpoints in a separate single-GPU script. This is actually cleaner — it decouples evaluation from training and avoids any risk of eval crashing a multi-day run.

Alternative to explore: Switching from FULL_SHARD to SHARD_GRAD_OP reduces the per-forward all-gather frequency and may make inline eval viable. This has not been tested yet.

What This Means for Small Labs

The original version of this article concluded that a dual-4090 workstation might not be viable for serious pretraining. That conclusion was premature.

The hardware was always capable. The software path was wrong.

If you are hitting NCCL deadlocks during FSDP checkpointing on consumer GPUs:

Stop using FULL_STATE_DICT during training
Switch to torch.distributed.checkpoint (DCP) with SHARDED_STATE_DICT
Use model.no_sync() for all gradient accumulation microsteps except the last
Do not run evaluation that requires full-state gathering
Set NCCL_P2P_DISABLE=1 on PCIe topologies
Disable torch.compile until training is proven stable
If you need a single consolidated checkpoint for inference, do that offline after training — not during

These are not exotic workarounds. They are the intended use of PyTorch's distributed checkpoint API for exactly this topology.

The Bigger Point

There is a narrative that local AI training on consumer hardware is either trivially possible or fundamentally limited. Neither is true.

The reality is more specific: the training loop works fine. The distributed systems layer is where consumer hardware diverges from datacenter hardware. If you understand that divergence and adjust your checkpoint and communication strategy accordingly, consumer GPUs become viable training infrastructure.

I am one person, training a 1B parameter model from scratch, on a workstation in my apartment, with no funding and no cluster. The model is learning right now.

That should not be remarkable in 2026. But it is, because the documentation and tooling still assume datacenter hardware by default. This post is an attempt to bridge that gap.

Software Stack

PyTorch: 2.9.1+cu128
CUDA: 13.0
OS: Pop!_OS (Linux, kernel 6.x)
GPU Driver: NVIDIA 580.126
Distributed: FSDP (FULL_SHARD) + DCP
Tracking: Weights & Biases
Tokenizer: SentencePiece BPE (49,152 vocab)

What's Next

Complete the 5.2B token run (~10 days) — currently in progress
Build standalone eval script for checkpoint evaluation (HellaSwag, ARC-Easy, PIQA)
Release Genesis 1B v0.1 on HuggingFace
Experiment with data mix adjustments at intermediate checkpoints
Continue training in increments — release improved versions as the model sees more data
Explore MoE upcycling from dense checkpoints
Optimize throughput: gradient checkpointing + torch.compile for batch_size=2
Scale to 3B on NVLink-equipped cloud instances for comparison

Try It Yourself

The model is training live. Select a checkpoint and generate text to see how it evolves over time:

Contact

If you are a founder, independent researcher, or small lab working on multi-GPU local training and have encountered similar checkpoint or synchronization failures on consumer hardware, reach out at [email protected].