Day: July 15, 2025
Deploying SLURM with Slinky: Bridging HPC and Kubernetes for Container Workloads
High-Performance Computing (HPC) environments are evolving rapidly, and the need to integrate traditional HPC job schedulers with modern containerized infrastructure has never been greater. Enter Slinky – SchedMD’s official project that seamlessly integrates SLURM with Kubernetes, enabling you to run containerized workloads through SLURM’s powerful scheduling capabilities.
In this comprehensive guide, we’ll walk through deploying SLURM using Slinky with Docker container support, bringing together the best of both HPC and cloud-native worlds.
What is Slinky?
Slinky is a toolbox of components developed by SchedMD (the creators of SLURM) to integrate SLURM with Kubernetes. Unlike traditional approaches that force users to change how they interact with SLURM, Slinky preserves the familiar SLURM user experience while adding powerful container orchestration capabilities.
Key Components:
- Slurm Operator – Manages SLURM clusters as Kubernetes resources
- Container Support – Native OCI container execution through SLURM
- Auto-scaling – Dynamic resource allocation based on workload demand
- Slurm Bridge – Converged workload scheduling and prioritization
Prerequisites and Environment Setup
Before we begin, ensure you have a working Kubernetes cluster with the following requirements:
- Kubernetes 1.24+ cluster with admin access
- Helm 3.x installed
- kubectl configured and connected to your cluster
- Sufficient cluster resources (minimum 4 CPU cores, 8GB RAM)
Step 1: Install Required Dependencies
Slinky requires several prerequisite components. Let’s install them using Helm:
# Add required Helm repositories helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami https://charts.bitnami.com/bitnami helm repo add jetstack https://charts.jetstack.io helm repo update # Install cert-manager for TLS certificate management helm install cert-manager jetstack/cert-manager \ --namespace cert-manager --create-namespace --set crds.enabled=true # Install Prometheus stack for monitoring helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace prometheus --create-namespace --set installCRDs=true
Wait for all pods to be running before proceeding:
# Verify installations kubectl get pods -n cert-manager kubectl get pods -n prometheus
Step 2: Deploy the Slinky SLURM Operator
Now we’ll install the core Slinky operator that manages SLURM clusters within Kubernetes:
# Download the default configuration curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml \ -o values-operator.yaml # Install the Slurm Operator helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \ --values=values-operator.yaml --version=0.2.1 \ --namespace=slinky --create-namespace
Verify the operator is running:
kubectl get pods -n slinky # Expected output: slurm-operator pod in Running status
Step 3: Configure Container Support
Before deploying the SLURM cluster, let’s configure it for container support. Download and modify the SLURM configuration:
# Download SLURM cluster configuration curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml \ -o values-slurm.yaml
Edit values-slurm.yaml to enable container support:
# Add container configuration to values-slurm.yaml
controller:
config:
slurm.conf: |
# Basic cluster configuration
ClusterName=slinky-cluster
ControlMachine=slurm-controller-0
# Enable container support
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
PluginDir=/usr/lib64/slurm
# Authentication
AuthType=auth/munge
# Node configuration
NodeName=slurm-compute-debug-[0-9] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=slurm-compute-debug-[0-9] Default=YES MaxTime=INFINITE State=UP
# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm-accounting-0
compute:
config:
oci.conf: |
# OCI container runtime configuration
RunTimeQuery="runc --version"
RunTimeCreate="runc create %n.%u %b"
RunTimeStart="runc start %n.%u"
RunTimeKill="runc kill --all %n.%u SIGTERM"
RunTimeDelete="runc delete --force %n.%u"
# Security and patterns
OCIPattern="^[a-zA-Z0-9][a-zA-Z0-9_.-]*$"
CreateEnvFile="/tmp/slurm-oci-create-env-%j.%u.%t.tmp"
RunTimeEnvExclude="HOME,PATH,LD_LIBRARY_PATH"
Step 4: Deploy the SLURM Cluster
Now deploy the SLURM cluster with container support enabled:
# Deploy SLURM cluster helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \ --values=values-slurm.yaml --version=0.2.1 \ --namespace=slurm --create-namespace
Monitor the deployment progress:
# Watch pods come online kubectl get pods -n slurm -w # Expected pods: # slurm-accounting-0 1/1 Running # slurm-compute-debug-0 1/1 Running # slurm-controller-0 2/2 Running # slurm-exporter-xxx 1/1 Running # slurm-login-xxx 1/1 Running # slurm-mariadb-0 1/1 Running # slurm-restapi-xxx 1/1 Running
Step 5: Access and Test the SLURM Cluster
Once all pods are running, connect to the SLURM login node:
# Get login node IP address
SLURM_LOGIN_IP="$(kubectl get services -n slurm -l app.kubernetes.io/instance=slurm,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"
# SSH to login node (default port 2222)
ssh -p 2222 root@${SLURM_LOGIN_IP}
If you don’t have LoadBalancer support, use port-forwarding:
# Port forward to login pod kubectl port-forward -n slurm service/slurm-login 2222:2222 # Connect via localhost ssh -p 2222 root@localhost
Step 6: Running Container Jobs
Now for the exciting part – running containerized workloads through SLURM!
Basic Container Job
Create a simple container job script:
# Create a container job script cat > container_test.sh << EOF #!/bin/bash #SBATCH --job-name=container-hello #SBATCH --ntasks=1 #SBATCH --time=00:05:00 #SBATCH --container=docker://alpine:latest echo "Hello from containerized SLURM job!" echo "Running on node: \$(hostname)" echo "Job ID: \$SLURM_JOB_ID" echo "Container OS: \$(cat /etc/os-release | grep PRETTY_NAME)" EOF # Submit the job sbatch container_test.sh # Check job status squeue
Interactive Container Sessions
Run containers interactively using srun:
# Interactive Ubuntu container
srun --container=docker://ubuntu:20.04 /bin/bash
# Quick command in Alpine container
srun --container=docker://alpine:latest /bin/sh -c "echo 'Container execution successful'; uname -a"
# Python data science container
srun --container=docker://python:3.9 python -c "import sys; print(f'Python {sys.version} running in container')"
GPU Container Jobs
If your cluster has GPU nodes, you can run GPU-accelerated containers:
# GPU container job cat > gpu_container.sh << EOF #!/bin/bash #SBATCH --job-name=gpu-test #SBATCH --gres=gpu:1 #SBATCH --container=docker://nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi nvcc --version EOF sbatch gpu_container.sh
MPI Container Jobs
Run parallel MPI applications in containers:
# MPI container job cat > mpi_container.sh << EOF #!/bin/bash #SBATCH --job-name=mpi-test #SBATCH --ntasks=4 #SBATCH --container=docker://mpirun/openmpi:latest mpirun -np \$SLURM_NTASKS hostname EOF sbatch mpi_container.sh
Step 7: Monitoring and Auto-scaling
Monitor Cluster Health
Check SLURM cluster status from the login node:
# Check node status sinfo # Check running jobs squeue # Check cluster configuration scontrol show config | grep -i container
Kubernetes Monitoring
Monitor from the Kubernetes side:
# Check pod resource usage kubectl top pods -n slurm # View SLURM operator logs kubectl logs -n slinky deployment/slurm-operator # Check custom resources kubectl get clusters.slinky.slurm.net -n slurm kubectl get nodesets.slinky.slurm.net -n slurm
Configure Auto-scaling
Enable auto-scaling by updating your values file:
# Add to values-slurm.yaml
compute:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 70
# Update the deployment
helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml --version=0.2.1 \
--namespace=slurm
Advanced Configuration Tips
Custom Container Runtimes
Configure alternative container runtimes like Podman:
# Alternative oci.conf for Podman
compute:
config:
oci.conf: |
# Podman runtime configuration
RunTimeQuery="podman --version"
RunTimeRun="podman run --rm --cgroups=disabled --name=%n.%u %m %c"
# Security settings
OCIPattern="^[a-zA-Z0-9][a-zA-Z0-9_.-]*$"
CreateEnvFile="/tmp/slurm-oci-create-env-%j.%u.%t.tmp"
Persistent Storage for Containers
Configure persistent volumes for containerized jobs:
# Add persistent volume support
compute:
persistence:
enabled: true
storageClass: "fast-ssd"
size: "100Gi"
mountPath: "/shared"
Troubleshooting Common Issues
Container Runtime Not Found
If you encounter container runtime errors:
# Check runtime availability on compute nodes kubectl exec -n slurm slurm-compute-debug-0 -- which runc kubectl exec -n slurm slurm-compute-debug-0 -- runc --version # Verify oci.conf is properly mounted kubectl exec -n slurm slurm-compute-debug-0 -- cat /etc/slurm/oci.conf
Job Submission Failures
Debug job submission issues:
# Check SLURM logs kubectl logs -n slurm slurm-controller-0 -c slurmctld # Verify container image availability srun --container=docker://alpine:latest /bin/echo "Container test" # Check job details scontrol show job
Conclusion
Slinky represents a significant step forward in bridging the gap between traditional HPC and modern cloud-native infrastructure. By deploying SLURM with Slinky, you get:
- Unified Infrastructure - Run both SLURM and Kubernetes workloads on the same cluster
- Container Support - Native OCI container execution through familiar SLURM commands
- Auto-scaling - Dynamic resource allocation based on workload demand
- Cloud Native - Standard Kubernetes deployment and management patterns
- Preserved Workflow - Keep existing SLURM scripts and user experience
This powerful combination enables organizations to modernize their HPC infrastructure while maintaining the robust scheduling and resource management capabilities that SLURM is known for. Whether you're running AI/ML training workloads, scientific simulations, or data processing pipelines, Slinky provides the flexibility to containerize your applications without sacrificing the control and efficiency of SLURM.
Ready to get started? The Slinky project is open-source and available on GitHub. Visit the SlinkyProject GitHub organization for the latest documentation and releases.
