Multi-Node Training

🖥️ 2×8 H100 Example

Node 0 (master)

H100 H100 H100 H100 H100 H100 H100 H100

NVLink within node

EFA

3200 Gbps

Node 1 (worker)

H100 H100 H100 H100 H100 H100 H100 H100

NVLink within node

📊 NCCL AllReduce Bandwidth

Single-node NVLink (baseline) ~34 GB/s

Cross-node EFA (best, 2×8 GPUs) ~21 GB/s avg · 33.6 peak

Cross-node EFA (conservative) ~9.5 GB/s avg

Future w/ GPUDirect RDMA (estimated) ~300+ GB/s

💻 One Command — Automatic NCCL + SSH + Env Setup

          $
           gpu-dev reserve
           --gpu-type
           h100
           --gpus
           16
           --distributed
           --hours
           8

          

          📋 Reservation mn-abc123 — 2 nodes × 8 GPUs

          🔗 EFA networking configured · peer SSH ready

          ✅ MASTER_ADDR, WORLD_SIZE, RANK, NCCL_* pre-configured

          

          # Just run torchrun — everything is set up

          dev@node-0 $
           torchrun
           --nproc_per_node 8
           --nnodes 2
           --node_rank 0
           train.py
        

Multi-Node Distributed Training

🖥️ 2×8 H100 Example

📊 NCCL AllReduce Bandwidth

💻 One Command — Automatic NCCL + SSH + Env Setup