Day: July 15, 2025

admin | July 15, 2025

Deploying SLURM with Slinky: Bridging HPC and Kubernetes for Container Workloads

High-Performance Computing (HPC) environments are evolving rapidly, and the need to integrate traditional HPC job schedulers with modern containerized infrastructure has never been greater. Enter Slinky – SchedMD’s official project that seamlessly integrates SLURM with Kubernetes, enabling you to run containerized workloads through SLURM’s powerful scheduling capabilities.

In this comprehensive guide, we’ll walk through deploying SLURM using Slinky with Docker container support, bringing together the best of both HPC and cloud-native worlds.

What is Slinky?

Slinky is a toolbox of components developed by SchedMD (the creators of SLURM) to integrate SLURM with Kubernetes. Unlike traditional approaches that force users to change how they interact with SLURM, Slinky preserves the familiar SLURM user experience while adding powerful container orchestration capabilities.

Key Components:

Slurm Operator – Manages SLURM clusters as Kubernetes resources
Container Support – Native OCI container execution through SLURM
Auto-scaling – Dynamic resource allocation based on workload demand
Slurm Bridge – Converged workload scheduling and prioritization

Why Slinky Matters: Slinky enables simultaneous management of HPC workloads using SLURM and containerized applications via Kubernetes on the same infrastructure, making it ideal for organizations running AI/ML training, scientific simulations, and cloud-native applications.

Prerequisites and Environment Setup

Before we begin, ensure you have a working Kubernetes cluster with the following requirements:

Kubernetes 1.24+ cluster with admin access
Helm 3.x installed
kubectl configured and connected to your cluster
Sufficient cluster resources (minimum 4 CPU cores, 8GB RAM)

Step 1: Install Required Dependencies

Slinky requires several prerequisite components. Let’s install them using Helm:

# Add required Helm repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Install cert-manager for TLS certificate management
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace --set crds.enabled=true

# Install Prometheus stack for monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace prometheus --create-namespace --set installCRDs=true

Wait for all pods to be running before proceeding:

# Verify installations
kubectl get pods -n cert-manager
kubectl get pods -n prometheus

Step 2: Deploy the Slinky SLURM Operator

Now we’ll install the core Slinky operator that manages SLURM clusters within Kubernetes:

# Download the default configuration
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml \
  -o values-operator.yaml

# Install the Slurm Operator
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --values=values-operator.yaml --version=0.2.1 \
  --namespace=slinky --create-namespace

Verify the operator is running:

kubectl get pods -n slinky
# Expected output: slurm-operator pod in Running status

Step 3: Configure Container Support

Before deploying the SLURM cluster, let’s configure it for container support. Download and modify the SLURM configuration:

# Download SLURM cluster configuration
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml \
  -o values-slurm.yaml

Edit values-slurm.yaml to enable container support:

# Add container configuration to values-slurm.yaml
controller:
  config:
    slurm.conf: |
      # Basic cluster configuration
      ClusterName=slinky-cluster
      ControlMachine=slurm-controller-0
      
      # Enable container support
      ProctrackType=proctrack/cgroup
      TaskPlugin=task/cgroup,task/affinity
      PluginDir=/usr/lib64/slurm
      
      # Authentication
      AuthType=auth/munge
      
      # Node configuration
      NodeName=slurm-compute-debug-[0-9] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
      PartitionName=debug Nodes=slurm-compute-debug-[0-9] Default=YES MaxTime=INFINITE State=UP
      
      # Accounting
      AccountingStorageType=accounting_storage/slurmdbd
      AccountingStorageHost=slurm-accounting-0

compute:
  config:
    oci.conf: |
      # OCI container runtime configuration
      RunTimeQuery="runc --version"
      RunTimeCreate="runc create %n.%u %b"
      RunTimeStart="runc start %n.%u"
      RunTimeKill="runc kill --all %n.%u SIGTERM"
      RunTimeDelete="runc delete --force %n.%u"
      
      # Security and patterns
      OCIPattern="^[a-zA-Z0-9][a-zA-Z0-9_.-]*$"
      CreateEnvFile="/tmp/slurm-oci-create-env-%j.%u.%t.tmp"
      RunTimeEnvExclude="HOME,PATH,LD_LIBRARY_PATH"

Step 4: Deploy the SLURM Cluster

Now deploy the SLURM cluster with container support enabled:

# Deploy SLURM cluster
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --values=values-slurm.yaml --version=0.2.1 \
  --namespace=slurm --create-namespace

Monitor the deployment progress:

# Watch pods come online
kubectl get pods -n slurm -w

# Expected pods:
# slurm-accounting-0      1/1     Running
# slurm-compute-debug-0   1/1     Running  
# slurm-controller-0      2/2     Running
# slurm-exporter-xxx      1/1     Running
# slurm-login-xxx         1/1     Running
# slurm-mariadb-0         1/1     Running
# slurm-restapi-xxx       1/1     Running

Step 5: Access and Test the SLURM Cluster

Once all pods are running, connect to the SLURM login node:

# Get login node IP address
SLURM_LOGIN_IP="$(kubectl get services -n slurm -l app.kubernetes.io/instance=slurm,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"

# SSH to login node (default port 2222)
ssh -p 2222 root@${SLURM_LOGIN_IP}

If you don’t have LoadBalancer support, use port-forwarding:

# Port forward to login pod
kubectl port-forward -n slurm service/slurm-login 2222:2222

# Connect via localhost
ssh -p 2222 root@localhost

Step 6: Running Container Jobs

Now for the exciting part – running containerized workloads through SLURM!

Basic Container Job

Create a simple container job script:

# Create a container job script
cat > container_test.sh << EOF
#!/bin/bash
#SBATCH --job-name=container-hello
#SBATCH --ntasks=1
#SBATCH --time=00:05:00
#SBATCH --container=docker://alpine:latest

echo "Hello from containerized SLURM job!"
echo "Running on node: \$(hostname)"
echo "Job ID: \$SLURM_JOB_ID"
echo "Container OS: \$(cat /etc/os-release | grep PRETTY_NAME)"
EOF

# Submit the job
sbatch container_test.sh

# Check job status
squeue

Interactive Container Sessions

Run containers interactively using srun:

# Interactive Ubuntu container
srun --container=docker://ubuntu:20.04 /bin/bash

# Quick command in Alpine container
srun --container=docker://alpine:latest /bin/sh -c "echo 'Container execution successful'; uname -a"

# Python data science container
srun --container=docker://python:3.9 python -c "import sys; print(f'Python {sys.version} running in container')"

GPU Container Jobs

If your cluster has GPU nodes, you can run GPU-accelerated containers:

# GPU container job
cat > gpu_container.sh << EOF
#!/bin/bash
#SBATCH --job-name=gpu-test
#SBATCH --gres=gpu:1
#SBATCH --container=docker://nvidia/cuda:11.0-runtime-ubuntu20.04

nvidia-smi
nvcc --version
EOF

sbatch gpu_container.sh

MPI Container Jobs

Run parallel MPI applications in containers:

# MPI container job
cat > mpi_container.sh << EOF
#!/bin/bash
#SBATCH --job-name=mpi-test
#SBATCH --ntasks=4
#SBATCH --container=docker://mpirun/openmpi:latest

mpirun -np \$SLURM_NTASKS hostname
EOF

sbatch mpi_container.sh

Step 7: Monitoring and Auto-scaling

Monitor Cluster Health

Check SLURM cluster status from the login node:

# Check node status
sinfo

# Check running jobs
squeue

# Check cluster configuration
scontrol show config | grep -i container

Kubernetes Monitoring

Monitor from the Kubernetes side:

# Check pod resource usage
kubectl top pods -n slurm

# View SLURM operator logs
kubectl logs -n slinky deployment/slurm-operator

# Check custom resources
kubectl get clusters.slinky.slurm.net -n slurm
kubectl get nodesets.slinky.slurm.net -n slurm

Configure Auto-scaling

Enable auto-scaling by updating your values file:

# Add to values-slurm.yaml
compute:
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

# Update the deployment
helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --values=values-slurm.yaml --version=0.2.1 \
  --namespace=slurm

Advanced Configuration Tips

Custom Container Runtimes

Configure alternative container runtimes like Podman:

# Alternative oci.conf for Podman
compute:
  config:
    oci.conf: |
      # Podman runtime configuration
      RunTimeQuery="podman --version"
      RunTimeRun="podman run --rm --cgroups=disabled --name=%n.%u %m %c"
      
      # Security settings
      OCIPattern="^[a-zA-Z0-9][a-zA-Z0-9_.-]*$"
      CreateEnvFile="/tmp/slurm-oci-create-env-%j.%u.%t.tmp"

Persistent Storage for Containers

Configure persistent volumes for containerized jobs:

# Add persistent volume support
compute:
  persistence:
    enabled: true
    storageClass: "fast-ssd"
    size: "100Gi"
    mountPath: "/shared"

Troubleshooting Common Issues

Container Runtime Not Found

If you encounter container runtime errors:

# Check runtime availability on compute nodes
kubectl exec -n slurm slurm-compute-debug-0 -- which runc
kubectl exec -n slurm slurm-compute-debug-0 -- runc --version

# Verify oci.conf is properly mounted
kubectl exec -n slurm slurm-compute-debug-0 -- cat /etc/slurm/oci.conf

Job Submission Failures

Debug job submission issues:

# Check SLURM logs
kubectl logs -n slurm slurm-controller-0 -c slurmctld

# Verify container image availability
srun --container=docker://alpine:latest /bin/echo "Container test"

# Check job details
scontrol show job

Conclusion

Slinky represents a significant step forward in bridging the gap between traditional HPC and modern cloud-native infrastructure. By deploying SLURM with Slinky, you get:

Unified Infrastructure - Run both SLURM and Kubernetes workloads on the same cluster
Container Support - Native OCI container execution through familiar SLURM commands
Auto-scaling - Dynamic resource allocation based on workload demand
Cloud Native - Standard Kubernetes deployment and management patterns
Preserved Workflow - Keep existing SLURM scripts and user experience

This powerful combination enables organizations to modernize their HPC infrastructure while maintaining the robust scheduling and resource management capabilities that SLURM is known for. Whether you're running AI/ML training workloads, scientific simulations, or data processing pipelines, Slinky provides the flexibility to containerize your applications without sacrificing the control and efficiency of SLURM.

Next Steps: Consider exploring Slinky's advanced features like custom schedulers, resource quotas, and integration with cloud provider auto-scaling groups to further optimize your HPC container workloads.

Ready to get started? The Slinky project is open-source and available on GitHub. Visit the SlinkyProject GitHub organization for the latest documentation and releases.

admin July 15, 2025 HPCNo Comments »

Nick Tailor's Technical Blog

A detail-minded individual, combining strong technical understanding and communication skills with experiences in Systems Administration, Engineering, Automation, AI Automation and Solutions; a proven methodical problem solver.