Day: October 9, 2025
Slurm Job: Cluster Sampler & Diagnostics (One-Click)
This job collects GPU/CPU, memory, NUMA, PCIe/NVLink, NIC/IB, and optional Nsight/NCCL/iperf3 telemetry across all allocated nodes while your workload runs, then bundles everything into a single .tgz.
sbatch --export=ALL,WORKLOAD="torchrun --nproc_per_node=8 train.py --cfg config.yaml",ENABLE_NSYS=1,RUN_NCCL_TESTS=1,DURATION=1800 profile_env.slurm
Prefer a direct file? You can also grab the ready-made script: Download profile_env.slurm
A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network
Profiling Playbook: Detect GPU/CPU, Memory Bandwidth, and Network Bottlenecks
A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network.
0) Prep: Make the Test Reproducible
- Choose a workload: (a) your real training/inference job, plus (b) a couple of microbenchmarks.
- Pin placement/affinity: match production (same container, CUDA/cuDNN, drivers, env vars, GPU/CPU affinity).
- Record node info: driver, CUDA, GPU model, CPU model, NUMA, NIC, topology.
nvidia-smi; nvidia-smi topo -m
lscpu; numactl --hardware1) GPU Profiling (Utilization, Kernels, Memory, Interconnect)
Quick Live View (low overhead)
# 1s sampling: Power (p) Util (u) Clocks (c) Mem util (v) Enc/Dec (e) PCIe/NVLink (t)
nvidia-smi dmon -s pucvmet
# More fields, CSV:
nvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,clocks.sm,clocks.mem,power.draw,temperature.gpu,pcie.link.gen.current,pcie.link.width.current,clocks_throttle_reasons.active --format=csv -l 1- utilization.gpu ~ 0–40% while job is “busy” → likely CPU or input (I/O) bound.
- High memory util + low SM util → global memory bandwidth bound.
- Power below expected / throttling active → power/thermal cap or app clocks.
- PCIe gen/width lower than expected → host-device transfer bottleneck.
Deep Timeline (Nsight Systems → find where time is spent)
nsys profile -t cuda,osrt,nvtx,mpi --sample=process-tree -o /tmp/trace \
--export=sqlite python train.py
# Open /tmp/trace.qdrep in Nsight Systems GUI, or analyze the sqlite export- Long CPU gaps before kernels → dataloader/CPU stall.
- CUDA memcpy / NCCL all-reduce dominating → I/O or network bottleneck.
- Many short kernels with gaps → kernel launch overhead (try CUDA Graphs).
Kernel Efficiency (Nsight Compute → why GPU is slow)
ncu --set full --target-processes all -o /tmp/ncu python train.py
# Then: ncu --import /tmp/ncu.ncu-rep --csv --page summary- Low/achieved SM occupancy & high dram__throughput vs arithmetic intensity → memory-bound kernels.
- High barrier/serialization → reformulate kernels or change backend.
NVLink / PCIe Health
# NVLink counters (A100+/NVSwitch)
nvidia-smi nvlink -s
# Topology sanity:
nvidia-smi topo -mIf inter-GPU traffic stalls or retry errors climb, expect intra-node comms bottlenecks.
2) CPU & Memory-Bandwidth Profiling (Host Side)
Fast CPU View
mpstat -P ALL 1
pidstat -u -r -d 1 -p $(pgrep -n python) # CPU, RSS, I/O per PID
High CPU% & run queue + GPU idle → CPU compute bound (augmentations, tokenization).
Low CPU% & waiting on I/O + GPU idle → storage or network input bottleneck.
NUMA Locality (critical for feeders/data loaders)
numactl -s
numastat -p $(pgrep -n python) # remote vs local memory hitsMany remote hits → pin processes to closest NUMA node; bind NIC/GPU affinity.
Hardware Counters (perf) & Memory Bandwidth
# Whole process counters
perf stat -d -p $(pgrep -n python) -- sleep 30
# Hotspots (then open interactive report)
perf record -F 99 -g -p $(pgrep -n python) -- sleep 30
perf reportLow IPC + many L3/mem stalls → memory bandwidth bound on CPU. Validate with STREAM / Intel PCM:
# STREAM (approximate host RAM BW)
stream
# Intel PCM memory (Intel CPUs)
pcm-memory 13) Network Throughput/Latency (Intra & Inter-node)
Raw NIC Performance
# TCP test (adjust -P for parallel flows)
iperf3 -s # on server
iperf3 -c <server> -P 8 -t 30
# For UDP or specific MTU/Jumbo: use -u and set mtu via ip link/ethtoolCompare results to NIC line-rate (e.g., 100/200/400GbE).
RDMA / InfiniBand (if applicable)
ibstat; ibv_devinfo
ib_write_bw -d mlx5_0 -F -q 4 -l 512 -s 8388608 -D 30
ib_send_bw -d mlx5_0 -F -q 4 -l 512 -s 8388608 -D 30If RDMA BW/latency is poor, check PFC/ECN, RoCE config, and mtu 9000 end-to-end.
Collective (NCCL) Reality Check
# From nccl-tests (build once)
./build/all_reduce_perf -b 8M -e 1G -f 2 -g 8 # intra-node
# Multi-node (via mpirun or torchrun)Throughput far below expectation → network path/topology, or NCCL env (e.g., NCCL_IB, NCCL_NET_GDR_LEVEL, CollNet/NVLS).
NIC Counters / Driver
ethtool -S <iface> | egrep "err|drop|disc|pause"
ethtool -k <iface> # offloads; ensure GRO/LRO settings suit your stackGrowing errors/pause frames → congestion, bad optics, or flow-control tuning.
4) Tie It Together with a Roofline View
Compute intensity (FLOPs/byte) vs achieved bandwidth quickly classifies memory-bound vs compute-bound. Use Nsight Compute’s roofline page for kernels; for end-to-end, annotate steps with NVTX and view in Nsight Systems.
5) Microbenchmarks to Isolate Layers
- GPU math: HPL/HPL-AI, cuBLAS GEMM runner, nvidia/cuda-samples (matrixMulCUBLAS).
- Host RAM BW: STREAM.
- Disk I/O: fio (sequential vs random, queue depth).
- Network: iperf3, ib_*_bw, NCCL tests.
If microbenchmarks are fine but the real job isn’t, the issue is software pipeline (dataloader, preprocessing, small batch, Python GIL, etc.).
6) Common Bottlenecks → Fixes
| Symptom | Likely Bottleneck | Quick Fixes |
|---|---|---|
| GPU util low, CPU busy | CPU pipeline | Increase workers/prefetch, move aug to GPU (DALI), compile ops, pin threads/NUMA. |
| High GPU mem util, SM low | GPU mem-bound | Fuse kernels, better tensor layouts, mixed precision (bf16/fp16), larger batch if headroom. |
| NCCL all-reduce dominates | Network | Enable RDMA, tune NCCL env, jumbo MTU 9000, keep same switch tier, test CollNet/NVLS. |
| memcpy HtoD heavy | PCIe/host I/O | Page-locked buffers, async prefetch, increase batch queue, ensure max PCIe Gen/width. |
| Frequent GPU throttling | Power/Thermal | Raise power limit (if safe), fix cooling, set application clocks, check throttling reasons. |
| Remote NUMA hits high | NUMA | Bind processes to local NUMA of GPU/NIC, interleave wisely. |
7) Optional: One-Node Sampler Script
Paste into profile.sh and run bash profile.sh python train.py.
#!/usr/bin/env bash
set -euo pipefail
APP="$@" # e.g., python train.py
echo "== System =="
nvidia-smi --query-gpu=name,uuid,driver_version,pstate,pcie.link.gen.current,pcie.link.width.current --format=csv
lscpu | egrep 'Model name|Socket|NUMA|Thread|MHz'
echo
echo "== Start background samplers =="
(nvidia-smi dmon -s pucvmet -d 1 > /tmp/gpu_dmon.log) &
GPU_DMON_PID=$!
(pidstat -u -r -d 1 > /tmp/pidstat.log) &
PIDSTAT_PID=$!
echo "== Run workload =="
$APP || true
echo "== Cleanup =="
kill $GPU_DMON_PID $PIDSTAT_PID 2>/dev/null || true
echo "== Summaries =="
tail -n +1 /tmp/gpu_dmon.log | head
tail -n 20 /tmp/gpu_dmon.log
tail -n 20 /tmp/pidstat.log8) HPE-Specific Checks (If Relevant)
- HPE iLO/OneView: check thermal/power capping, fan curves, PSU headroom.
- HPE Performance Cluster Manager / Cray: use built-in telemetry and fabric diagnostics.
- BIOS: Performance power profile, NUMA exposed, deterministic turbo, PCIe Gen4/Gen5, Above 4G decoding on, SR-IOV/ATS if virtualized.
