Multi-Node Distributed Training

EFA networking + NCCL โ€” up to 64 GPUs across 8 nodes, one command

๐Ÿ–ฅ๏ธ 2ร—8 H100 Example

Node 0 (master)
H100 H100 H100 H100 H100 H100 H100 H100
NVLink within node
EFA
3200 Gbps
Node 1 (worker)
H100 H100 H100 H100 H100 H100 H100 H100
NVLink within node

๐Ÿ“Š NCCL AllReduce Bandwidth

Single-node NVLink (baseline) ~34 GB/s
Cross-node EFA (best, 2ร—8 GPUs) ~21 GB/s avg ยท 33.6 peak
Cross-node EFA (conservative) ~9.5 GB/s avg
Future w/ GPUDirect RDMA (estimated) ~300+ GB/s

๐Ÿ’ป One Command โ€” Automatic NCCL + SSH + Env Setup

$ gpu-dev reserve --gpu-type h100 --gpus 16 --distributed --hours 8

๐Ÿ“‹ Reservation mn-abc123 โ€” 2 nodes ร— 8 GPUs
๐Ÿ”— EFA networking configured ยท peer SSH ready
โœ… MASTER_ADDR, WORLD_SIZE, RANK, NCCL_* pre-configured

# Just run torchrun โ€” everything is set up
dev@node-0 $ torchrun --nproc_per_node 8 --nnodes 2 --node_rank 0 train.py