This repo contains some examples of distributed PyTorch training in multi-GPU, multi-node SLURM clusters. The examples are typically used to test SLURM clusters created on Crusoe Cloud.
Clone this repo and cd into the distributed-torchrun-on-slurm directory. Install the required packages in a virtual environment accessible to all hosts. In the Crusoe SLURM solution, /home of all cluster hosts is mounted to a shared volume, so by creating the venv on the login node it's available on every node. The .slurm script in each example refers to the venv location (in the venv activation step) so if you create the venv somewhere other than ~/distributed-torchrun-on-slurm, make sure to update the path in the .slurm scripts.
git clone https://github.com/crusoecloud/distributed-torchrun-on-slurm
cd distributed-torchrun-on-slurm
#install the right python venv for your python version
sudo apt install -y python3.10-venv
#create and activate the venv
python3 -m venv .venv
source .venv/bin/activate
#install torch packages direct from PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 #or whichever version of cuda is suitable for the nvidia driver on your ci
#install other packages
pip install -r requirements.txtcd into the directory for your chosen example e.g basic distributed training or the vision-transformer-training Each directory contains Python training programs (train_*.py) and a corresponding sbatch (run_*.slurm) script for each. Edit the sbatch script for the program you want to run to set the number of nodes, .venv location and any other options necessary for your environment. Then submit the job and monitor it with squeue, log tailing etc. The basic measure of success is that each job should iterate through epochs and generate checkpoints.
-
Environment activation (lines 24-27):
source ~/distributed-torchrun-on-slurm/venv/bin/activate
-
SLURM parameters (lines 2-10):
--nodes=4- Number of nodes (adjust as needed)--gpus-per-node=8- GPUs per node (adjust as needed)--cpus-per-task=32- CPU cores (adjust based on your system)--time=02:00:00- Time limit
-
Network interface (line 31):
export NCCL_SOCKET_IFNAME=^docker,lo # Adjust for your network setup
Common interfaces:
eth0,ib0(InfiniBand),enp0s31f6, etc.
# Make sure the logs directory exists
mkdir -p logs
# Submit the job
sbatch run_distributed.slurm# Check job status
squeue -u $USER
# View live output
tail -f logs/slurm-<job_id>.out
# View live errors
tail -f logs/slurm-<job_id>.errscancel <job_id>The training script uses PyTorch's Distributed Data Parallel (DDP):
-
torchrun launches one process per GPU (16 total processes)
-
Each process gets environment variables:
RANK- Global rank (0-15)LOCAL_RANK- Local rank on the node (0-7)WORLD_SIZE- Total number of processes (16)
-
NCCL handles inter-GPU communication for:
- Gradient synchronization
- Metric aggregation
- Barrier synchronization
- DistributedSampler splits the dataset across all GPUs
- Each GPU processes a unique subset of the data
- Gradients are averaged across all GPUs after each backward pass
From train_distributed.py:
setup_distributed()- Initializes the process groupDDP(model)- Wraps the model for distributed trainingDistributedSampler- Distributes data across GPUsdist.all_reduce()- Aggregates metrics across all processesdist.barrier()- Synchronizes all processes
Edit in run_distributed.slurm (lines 42-46):
EPOCHS=10
BATCH_SIZE=64 # Per-GPU batch size
LEARNING_RATE=0.01
DATA_DIR="./data"
CHECKPOINT_DIR="./checkpoints"Edit the SimpleConvNet class in train_distributed.py to use your own model.
Replace the MNIST dataset in get_dataloader() with your own dataset:
train_dataset = datasets.ImageFolder(
root=args.data_dir,
transform=transform
)-
NCCL Timeout or Initialization Failure
- Check network connectivity between nodes
- Verify firewall allows communication on MASTER_PORT
- Set
export NCCL_DEBUG=INFOfor detailed logs
-
Out of Memory (OOM)
- Reduce
--batch-size - Reduce model size
- Enable gradient checkpointing
- Reduce
-
Slow Training
- Check if InfiniBand is enabled (
NCCL_IB_DISABLE=0) - Verify GPUDirect RDMA is available (
NCCL_NET_GDR_LEVEL=2) - Increase
num_workersin DataLoader
- Check if InfiniBand is enabled (
-
Hanging at Initialization
- Ensure all nodes can reach the master node
- Check
NCCL_SOCKET_IFNAMEis set correctly - Verify SLURM environment variables are set
Enable verbose logging:
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL- Optimize batch size: Use the largest batch size that fits in GPU memory
- Pin memory: Already enabled in DataLoader (
pin_memory=True) - Multiple workers: Adjust
num_workersin DataLoader based on CPU count - Mixed precision: Add automatic mixed precision (AMP) for faster training:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()To scale to different configurations, only change the SLURM parameters:
#SBATCH --nodes=4 # 4 nodes
#SBATCH --gpus-per-node=8 # 8 GPUs per node = 32 total GPUsThe training script automatically adapts to the number of available GPUs.
Checkpoints are saved after each epoch to ./checkpoints/:
checkpoints/
checkpoint_epoch_1.pt
checkpoint_epoch_2.pt
...Only rank 0 saves checkpoints to avoid conflicts.
Check the output logs for:
Starting distributed training on 16 GPUs
Master node: node001
...
Epoch 1 - Avg Loss: 0.234567, Accuracy: 92.34%
The number of GPUs should match: nodes * gpus-per-node = total GPUs