Slurm Job: Cluster Sampler & Diagnostics (One-Click)
This job collects GPU/CPU, memory, NUMA, PCIe/NVLink, NIC/IB, and optional Nsight/NCCL/iperf3 telemetry across all allocated nodes while your workload runs, then bundles everything into a single .tgz.
Usage: Save as profile_env.slurm and submit:
sbatch --export=ALL,WORKLOAD="torchrun --nproc_per_node=8 train.py --cfg config.yaml",ENABLE_NSYS=1,RUN_NCCL_TESTS=1,DURATION=1800 profile_env.slurm
Prefer a direct file? You can also grab the ready-made script: Download profile_env.slurm
