Distributed training (multi-node) of a Transformer model
-
Updated
Apr 10, 2024 - Python
Distributed training (multi-node) of a Transformer model
Messaging and state layer for distributed serverless applications
Summary of call graphs and data structures of NVIDIA Collective Communication Library (NCCL)
Blink+: Increase GPU group bandwidth by utilizing across tenant NVLink.
collectives library for upc++
TileXR (eXtreme Rendezvous for Asynchronous Tile Communication) is a data-centric asynchronous communication runtime for Huawei Ascend NPUs. TileXR is an AI-native designed communication lib.
Audit GPU cluster communication schedules from NCCL logs. Zero dependencies. CI-ready.
Interactive web visualization for understanding collective communication algorithms (as used in NCCL, RCCL, MPI). Learn how AllReduce, Broadcast, Reduce, AllGather and more work step by step.
Research prototype investigating adaptive collective communication optimization for MPI workloads using runtime performance feedback.
Modelling of MPI collective operations latencies: Broadcast and Reduce operations. UniTS, SDIC, 2023-2024
AllReduce/AllGather scaling in ASTRA-sim across torus vs switch topologies on the analytical + ns-3 backends — latency-bound vs bandwidth-bound over message size and node count. Reproducible Docker/Chakra harness + write-up.
Simple quick test to benchmark your pytorch + nccl/ncclx setup
A reduction algorithm for MPI using only peer to peer communication
This repository contains simple programs of MPI_Bcast, MPI_Reduce, MPI_Scatter and MPI_Gather. Download the repository and test your self.
HPC course practice assignments for parallel-programming
Summary of call graphs and data structures of collective communication plugin in NVIDIA TensorRT-LLM
MPI laboratory project demonstrating collective communication primitives to perform distributed numerical computations on a vector. Implements broadcast, scatter, gather, reduce, and scan operations while managing vector segments across multiple processes (Introduction to Parallel Computing, UNIWA).
Develop high-performance parallel applications in C++ using the Partitioned Global Address Space model and asynchronous communication primitives.
Add a description, image, and links to the collective-communication topic page so that developers can more easily learn about it.
To associate your repository with the collective-communication topic, visit your repo's landing page and select "manage topics."