PyTorch Distributed Training on SLURM

This repo contains some examples of distributed PyTorch training in multi-GPU, multi-node SLURM clusters. The examples are typically used to test SLURM clusters created on Crusoe Cloud.

Pre-requisites

Clone this repo and cd into the distributed-torchrun-on-slurm directory. Install the required packages in a virtual environment accessible to all hosts. In the Crusoe SLURM solution, /home of all cluster hosts is mounted to a shared volume, so by creating the venv on the login node it's available on every node. The .slurm script in each example refers to the venv location (in the venv activation step) so if you create the venv somewhere other than ~/distributed-torchrun-on-slurm, make sure to update the path in the .slurm scripts.

git clone https://github.com/crusoecloud/distributed-torchrun-on-slurm
cd distributed-torchrun-on-slurm
#install the right python venv for your python version
sudo apt install -y python3.10-venv
#create and activate the venv
python3 -m venv .venv
source .venv/bin/activate
#install torch packages direct from PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 #or whichever version of cuda is suitable for the nvidia driver on your ci
#install other packages
pip install -r requirements.txt

Configuration

SLURM Script Configuration

cd into the directory for your chosen example e.g basic distributed training or the vision-transformer-training Each directory contains Python training programs (train_*.py) and a corresponding sbatch (run_*.slurm) script for each. Edit the sbatch script for the program you want to run to set the number of nodes, .venv location and any other options necessary for your environment. Then submit the job and monitor it with squeue, log tailing etc. The basic measure of success is that each job should iterate through epochs and generate checkpoints.

Environment activation (lines 24-27):

source ~/distributed-torchrun-on-slurm/venv/bin/activate

SLURM parameters (lines 2-10):
- --nodes=4 - Number of nodes (adjust as needed)
- --gpus-per-node=8 - GPUs per node (adjust as needed)
- --cpus-per-task=32 - CPU cores (adjust based on your system)
- --time=02:00:00 - Time limit
Network interface (line 31):
```
export NCCL_SOCKET_IFNAME=^docker,lo  # Adjust for your network setup
```
Common interfaces: eth0, ib0 (InfiniBand), enp0s31f6, etc.

Usage

1. Submit the SLURM Job

# Make sure the logs directory exists
mkdir -p logs

# Submit the job
sbatch run_distributed.slurm

2. Monitor the Job

# Check job status
squeue -u $USER

# View live output
tail -f logs/slurm-<job_id>.out

# View live errors
tail -f logs/slurm-<job_id>.err

3. Cancel the Job (if needed)

scancel <job_id>

How It Works

Distributed Setup

The training script uses PyTorch's Distributed Data Parallel (DDP):

torchrun launches one process per GPU (16 total processes)
Each process gets environment variables:
- RANK - Global rank (0-15)
- LOCAL_RANK - Local rank on the node (0-7)
- WORLD_SIZE - Total number of processes (16)
NCCL handles inter-GPU communication for:
- Gradient synchronization
- Metric aggregation
- Barrier synchronization

Data Distribution

DistributedSampler splits the dataset across all GPUs
Each GPU processes a unique subset of the data
Gradients are averaged across all GPUs after each backward pass

Key Components

From train_distributed.py:

setup_distributed() - Initializes the process group
DDP(model) - Wraps the model for distributed training
DistributedSampler - Distributes data across GPUs
dist.all_reduce() - Aggregates metrics across all processes
dist.barrier() - Synchronizes all processes

Customization

Training Parameters

Edit in run_distributed.slurm (lines 42-46):

EPOCHS=10
BATCH_SIZE=64           # Per-GPU batch size
LEARNING_RATE=0.01
DATA_DIR="./data"
CHECKPOINT_DIR="./checkpoints"

Model Architecture

Edit the SimpleConvNet class in train_distributed.py to use your own model.

Dataset

Replace the MNIST dataset in get_dataloader() with your own dataset:

train_dataset = datasets.ImageFolder(
    root=args.data_dir,
    transform=transform
)

Troubleshooting

Common Issues

NCCL Timeout or Initialization Failure
- Check network connectivity between nodes
- Verify firewall allows communication on MASTER_PORT
- Set export NCCL_DEBUG=INFO for detailed logs
Out of Memory (OOM)
- Reduce --batch-size
- Reduce model size
- Enable gradient checkpointing
Slow Training
- Check if InfiniBand is enabled (NCCL_IB_DISABLE=0)
- Verify GPUDirect RDMA is available (NCCL_NET_GDR_LEVEL=2)
- Increase num_workers in DataLoader
Hanging at Initialization
- Ensure all nodes can reach the master node
- Check NCCL_SOCKET_IFNAME is set correctly
- Verify SLURM environment variables are set

Debugging

Enable verbose logging:

export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL

Performance Tips

Optimize batch size: Use the largest batch size that fits in GPU memory
Pin memory: Already enabled in DataLoader (pin_memory=True)
Multiple workers: Adjust num_workers in DataLoader based on CPU count
Mixed precision: Add automatic mixed precision (AMP) for faster training:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

with autocast():
    output = model(data)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Scaling to More Nodes/GPUs

To scale to different configurations, only change the SLURM parameters:

#SBATCH --nodes=4                    # 4 nodes
#SBATCH --gpus-per-node=8            # 8 GPUs per node = 32 total GPUs

The training script automatically adapts to the number of available GPUs.

Checkpoints

Checkpoints are saved after each epoch to ./checkpoints/:

checkpoints/
  checkpoint_epoch_1.pt
  checkpoint_epoch_2.pt
  ...

Only rank 0 saves checkpoints to avoid conflicts.

Verifying Distributed Training

Check the output logs for:

Starting distributed training on 16 GPUs
Master node: node001
...
Epoch 1 - Avg Loss: 0.234567, Accuracy: 92.34%

The number of GPUs should match: nodes * gpus-per-node = total GPUs

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
basic-distributed-training		basic-distributed-training
fsdp-training		fsdp-training
torchtitan		torchtitan
train-and-serve-qwen-vl		train-and-serve-qwen-vl
vision-transformer-training		vision-transformer-training
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyTorch Distributed Training on SLURM

Pre-requisites

Configuration

SLURM Script Configuration

Usage

1. Submit the SLURM Job

2. Monitor the Job

3. Cancel the Job (if needed)

How It Works

Distributed Setup

Data Distribution

Key Components

Customization

Training Parameters

Model Architecture

Dataset

Troubleshooting

Common Issues

Debugging

Performance Tips

Scaling to More Nodes/GPUs

Checkpoints

Verifying Distributed Training

Additional Resources

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PyTorch Distributed Training on SLURM

Pre-requisites

Configuration

SLURM Script Configuration

Usage

1. Submit the SLURM Job

2. Monitor the Job

3. Cancel the Job (if needed)

How It Works

Distributed Setup

Data Distribution

Key Components

Customization

Training Parameters

Model Architecture

Dataset

Troubleshooting

Common Issues

Debugging

Performance Tips

Scaling to More Nodes/GPUs

Checkpoints

Verifying Distributed Training

Additional Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages