Running Spark Data Jobs on Slurm

Slurm is good at giving you machines.

Spark is good at running distributed data jobs across those machines.

They are not the same thing.

And Spark is not automatically better than raw sbatch.

For many HPC workloads, plain Slurm is still the better tool. If you have many independent jobs, use Slurm job arrays and keep life simple.

Example:

sample_001.fastq -> run tool -> output_001
sample_002.fastq -> run tool -> output_002
sample_003.fastq -> run tool -> output_003

Each task is independent. One task does not need to talk to another task.

For that, raw Slurm is perfect.

#SBATCH --array=1-100

Spark becomes useful when the work is one large data-processing job.

Example:

Read a large dataset.
Split it into partitions.
Filter rows.
Join with another dataset.
Group records.
Shuffle data between workers.
Write a final result.

You can do this with shell scripts and sbatch, but at some point you start writing your own mini distributed system.

You need to:

split data
assign work
move data between workers
retry failed tasks
merge outputs
avoid duplicate work
keep track of intermediate results

Spark already does that.

So the clean split is:

Slurm gives us the nodes.
Spark handles the distributed data work.

This guide is about that combination.

Not Spark instead of Slurm.

Not Spark for every workload.

Just Spark data jobs running inside a Slurm allocation.

What we are building

We are going to build a small Slurm cluster and run Spark inside Slurm jobs.

The setup:

master    Slurm controller + shared storage
node1     compute node
node2     compute node
node3     compute node

The flow:

Prepare hostnames.
Install Slurm.
Test multi-node Slurm jobs.
Create a shared directory.
Install Java and Spark.
Run Spark inside a Slurm allocation.

The Spark cluster is temporary.

It starts when the Slurm job starts.

It stops when the Slurm job exits.

No permanent Spark service.

No Kubernetes.

No YARN.

No extra scheduler.

Prepare hostnames

All nodes must resolve each other by name.

Example /etc/hosts on every node:

0.0.10 master
0.0.11 node1
0.0.12 node2
0.0.13 node3

Test from every node:

ping master
ping node1
ping node2
ping node3

Do not continue until this works.

Bad hostnames create boring failures later.

Install Slurm and Munge

On all nodes:

sudo apt update
sudo apt install -y slurm-wlm munge

Munge is used by Slurm for authentication.

All nodes must share the same Munge key.

Configure Munge

On the master node:

sudo create-munge-key

Copy the key to compute nodes:

sudo scp /etc/munge/munge.key node1:/etc/munge/
sudo scp /etc/munge/munge.key node2:/etc/munge/
sudo scp /etc/munge/munge.key node3:/etc/munge/

On all nodes:

sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo systemctl enable munge
sudo systemctl restart munge

Test Munge:

munge -n | unmunge

Expected:

STATUS: Success

If Munge fails, fix it before moving on.

Slurm depends on it.

Create Slurm spool directories

On the master node:

sudo mkdir -p /var/spool/slurmctld
sudo chown slurm:slurm /var/spool/slurmctld
sudo chmod 755 /var/spool/slurmctld

On each compute node:

sudo mkdir -p /var/spool/slurmd
sudo chown slurm:slurm /var/spool/slurmd
sudo chmod 755 /var/spool/slurmd

Check the Slurm user:

id slurm

Some distributions use slightly different packaging.

If the slurm user does not exist, create it or use the correct user for your package.

Create a basic Slurm config

Create /etc/slurm/slurm.conf on all nodes.

Example:

ClusterName=spark-slurm
ControlMachine=master

SlurmUser=slurm
AuthType=auth/munge

StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd

ProctrackType=proctrack/linuxproc
TaskPlugin=task/none

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log

NodeName=node[1-3] CPUs=16 RealMemory=64000 State=UNKNOWN

PartitionName=compute Nodes=node[1-3] Default=YES MaxTime=INFINITE State=UP

Adjust:

CPUs
RealMemory
node names
partition name

Get CPU count:

nproc

Get memory in MB:

free -m

Keep the first Slurm config boring.

Do not add GPUs, accounting, cgroups, or advanced policies yet.

First make basic multi-node jobs work.

Start Slurm

On the master node:

sudo systemctl enable slurmctld
sudo systemctl restart slurmctld

On each compute node:

sudo systemctl enable slurmd
sudo systemctl restart slurmd

Check cluster status from the master:

sinfo

Expected shape:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      3   idle node[1-3]

If nodes are down:

scontrol show node node1
sudo journalctl -u slurmd -n 100
sudo journalctl -u slurmctld -n 100

Fix Slurm first.

Spark will not make a broken Slurm cluster better.

Test a multi-node Slurm job

Create slurm-test.sh:

#!/bin/bash
#SBATCH --job-name=slurm-test
#SBATCH --partition=compute
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=2G
#SBATCH --time=00:05:00
#SBATCH --output=slurm-test-%j.out

set -euo pipefail

echo "Job ID: $SLURM_JOB_ID"
echo "Node list: $SLURM_JOB_NODELIST"

echo "Expanded nodes:"
scontrol show hostnames "$SLURM_JOB_NODELIST"

echo "Running hostname on all tasks:"
srun hostname

echo "CPU check:"
srun bash -c 'echo "$(hostname): $(nproc) CPUs"'

echo "Memory check:"
srun bash -c 'echo "$(hostname): $(free -m | awk "/Mem:/ {print \$2}") MB"'

Submit it:

sbatch slurm-test.sh

Check output:

cat slurm-test-<jobid>.out

You should see all three nodes.

This is the first checkpoint.

Do not move on until multi-node Slurm jobs work.

Create a shared directory

Spark workers need to read and write the same paths.

For this small cluster, we will use NFS.

The master exports:

/shared

The compute nodes mount:

/shared

Spark will read from:

/shared/data

Spark will write to:

/shared/output

This is not the fastest storage design in the world.

It is just simple and good enough for a small cluster tutorial.

Configure NFS on the master

On the master node:

sudo apt install -y nfs-kernel-server

Create directories:

sudo mkdir -p /shared/data
sudo mkdir -p /shared/output
sudo mkdir -p /shared/logs

For a simple lab setup:

sudo chown -R nobody:nogroup /shared
sudo chmod -R 777 /shared

This is loose permissioning.

For a real environment, use proper users and groups.

Edit /etc/exports:

sudo nano /etc/exports

Add this, adjusted to your subnet:

/shared 10.0.0.0/24(rw,sync,no_subtree_check)

Apply:

sudo exportfs -ra
sudo systemctl enable nfs-server
sudo systemctl restart nfs-server

Check:

sudo exportfs -v

Mount the shared directory on compute nodes

On each compute node:

sudo apt install -y nfs-common
sudo mkdir -p /shared
sudo mount master:/shared /shared

Test from each compute node:

touch /shared/test-from-$(hostname)
ls -l /shared

You should see files created by the other nodes.

Make the mount persistent.

Edit /etc/fstab on each compute node:

sudo nano /etc/fstab

Add:

master:/shared /shared nfs defaults,_netdev 0 0

Test:

sudo umount /shared
sudo mount -a
ls /shared

If this fails, fix it before moving on.

Spark workers need the same path on every node.

Test shared storage through Slurm

Create shared-test.sh:

#!/bin/bash
#SBATCH --job-name=shared-test
#SBATCH --partition=compute
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:05:00
#SBATCH --output=shared-test-%j.out

set -euo pipefail

srun bash -c 'echo "hello from $(hostname)" > /shared/output/test-$(hostname).txt'

echo "Files written:"
ls -l /shared/output/test-*.txt

echo "File contents:"
cat /shared/output/test-*.txt

Submit:

sbatch shared-test.sh

Check:

cat shared-test-<jobid>.out

You should see one file from each node.

This is the second checkpoint.

Now the cluster has a shared place for Spark input, output, and logs.

Install Java

Spark needs Java.

On all nodes:

sudo apt install -y openjdk-17-jre-headless

Check:

java -version

Use the same Java version on all nodes.

Install Spark on all nodes

Install Spark to the same path on every node.

Example path:

/opt/spark

On all nodes:

cd /opt
sudo tar -xzf spark-3.5.1-bin-hadoop3.tgz
sudo ln -sfn spark-3.5.1-bin-hadoop3 spark

Add Spark environment variables:

sudo tee /etc/profile.d/spark.sh >/dev/null <<'EOF'
export SPARK_HOME=/opt/spark
export PATH="$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH"
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
EOF

Reload:

source /etc/profile.d/spark.sh

Check:

spark-submit --version

Run this on every node.

All nodes should return the same Spark version.

Create a small PySpark job

Create job.py:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = (
    SparkSession.builder
    .appName("spark-on-slurm-test")
    .getOrCreate()
)

sc = spark.sparkContext

print("Spark master:", sc.master)
print("Default parallelism:", sc.defaultParallelism)

input_path = "/shared/data/events"
output_path = "/shared/output/events_summary"

data = [(i, "even" if i % 2 == 0 else "odd") for i in range(1_000_000)]

df = spark.createDataFrame(data, ["id", "kind"])

df.write.mode("overwrite").parquet(input_path)

events = spark.read.parquet(input_path)

summary = (
    events
    .where(col("id") >= 0)
    .groupBy("kind")
    .count()
)

summary.show()

summary.write.mode("overwrite").parquet(output_path)

print("Wrote output to:", output_path)

spark.stop()

This job does three things:

writes Parquet to shared storage
reads Parquet from shared storage
runs a small aggregation

It is intentionally boring.

Boring tests are easier to debug.

Run Spark inside a Slurm job

Create spark-slurm.sh:

#!/bin/bash
#SBATCH --job-name=spark-slurm
#SBATCH --partition=compute
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=01:00:00
#SBATCH --output=spark-%j.out
#SBATCH --error=spark-%j.err

set -euo pipefail

export SPARK_HOME=/opt/spark
export PATH="$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH"
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

export SPARK_NO_DAEMONIZE=true
export SPARK_LOG_DIR="/shared/logs/spark-$SLURM_JOB_ID"

mkdir -p "$SPARK_LOG_DIR"

nodes=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))
master_node="${nodes[0]}"
master_url="spark://${master_node}:7077"

echo "Job ID: $SLURM_JOB_ID"
echo "Allocated nodes:"
printf '%s\n' "${nodes[@]}"
echo "Spark master node: $master_node"
echo "Spark master URL: $master_url"
echo "Spark log dir: $SPARK_LOG_DIR"

cleanup() {
    echo "Cleaning up Spark"

    for node in "${nodes[@]}"; do
        srun --nodes=1 --ntasks=1 -w "$node" \
            "$SPARK_HOME/sbin/stop-worker.sh" || true
    done

    srun --nodes=1 --ntasks=1 -w "$master_node" \
        "$SPARK_HOME/sbin/stop-master.sh" || true
}

trap cleanup EXIT

echo "Starting Spark master"
srun --nodes=1 --ntasks=1 -w "$master_node" \
    "$SPARK_HOME/sbin/start-master.sh" &

sleep 10

echo "Starting Spark workers"
for node in "${nodes[@]}"; do
    srun --nodes=1 --ntasks=1 -w "$node" \
        "$SPARK_HOME/sbin/start-worker.sh" "$master_url" &
done

sleep 15

echo "Submitting Spark job"
spark-submit \
    --master "$master_url" \
    --deploy-mode client \
    --executor-cores 8 \
    --executor-memory 24G \
    ./job.py

Submit:

sbatch spark-slurm.sh

Check output:

cat spark-<jobid>.out
cat spark-<jobid>.err

Check Spark logs:

ls /shared/logs/spark-<jobid>

Check output data:

ls /shared/output/events_summary

You should see Parquet output.

Why the Spark script works

Slurm gives the job a node list:

$SLURM_JOB_NODELIST

This expands it:

nodes=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))

The first node becomes the Spark master:

master_node="${nodes[0]}"

The Spark master URL becomes something like:

spark://node1:7077

Then the script starts:

one Spark master
one Spark worker per Slurm node

Finally:

spark-submit --master "$master_url" ./job.py

The Spark job runs only inside the nodes Slurm allocated.

The data paths are shared:

/shared/data
/shared/output
/shared/logs

That is the key.

Without a shared path, workers may not see the same files.

Resource mapping

Keep Slurm and Spark aligned.

This Slurm request:

#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

means:

nodes
Spark worker per node
CPU cores per worker
GB memory per node

So this Spark config makes sense:

--executor-cores 8
--executor-memory 24G

Do not do this:

#SBATCH --cpus-per-task=8

and then:

--executor-cores 32

That asks Spark to use more CPU than Slurm gave the job.

It may still run.

It is still wrong.

Leave memory headroom

If Slurm gives each node 32 GB:

#SBATCH --mem=32G

Do not give all 32 GB to Spark:

--executor-memory 32G

Use less:

--executor-memory 24G

Spark needs overhead.

The JVM needs memory.

Python needs memory.

The OS needs memory.

A simple rule:

executor memory = around 70-80% of Slurm memory

This avoids many avoidable crashes.

Do not use SSH startup first

Many Spark guides use:

start-all.sh

That usually depends on SSH.

For Slurm, start with srun.

Use this:

srun -w "$node" "$SPARK_HOME/sbin/start-worker.sh" "$master_url"

Avoid this:

ssh "$node" "$SPARK_HOME/sbin/start-worker.sh" "$master_url"

ssh can work, but it escapes the Slurm mental model.

srun keeps the process inside the allocation.

That is cleaner.

When raw Slurm is still better

Use raw Slurm when the work is independent.

Good examples:

FastQC per sample
BWA per sample
samtools per BAM
independent simulations
image conversion
many small file-by-file jobs
simple batch commands

For these, use:

#SBATCH --array=1-100

Do not force Spark into this.

Spark adds startup time, JVM overhead, logs, ports, and more moving pieces.

That overhead is only worth it when Spark is solving a real data-distribution problem.

When Postgres is still better

Spark is not automatically better than Postgres either.

Use Postgres when the data fits well on one database server and you need:

indexed SQL queries
transactions
data integrity
frequent interactive queries
many small lookups
application serving
relational constraints

A tuned Postgres box can beat a small Spark cluster for many SQL workloads.

Do not use Spark just because the dataset feels big.

Use Spark when the data already lives as many files on shared storage, or when one job needs to scan, join, aggregate, and transform data across multiple nodes.

Postgres is a database.

Spark is a distributed compute engine.

Different tools.

When Spark on Slurm is better

Use Spark on Slurm when the workload needs coordinated data processing.

Good examples:

large Parquet processing
large table processing
joins across big datasets
group-by and aggregation
repeated filtering and transformation
distributed feature generation
data preparation for ML
workloads that benefit from caching
jobs where task retry matters

This is where raw sbatch starts to feel awkward.

Spark gives you the distributed data engine.

Slurm gives you the machines.

Shared storage gives all workers the same data path.

That combination is the point.

NFS limits

NFS is fine for a small tutorial cluster.

But NFS is not magic.

It can become the bottleneck when:

many nodes read heavily
many nodes write heavily
there are many small files
metadata operations are high
shuffle output is large
the cluster grows

If storage becomes the bottleneck, move to something built for this:

Lustre
BeeGFS
GPFS
Ceph
S3 / MinIO
HDFS

Start simple.

But do not pretend NFS is the final answer for every cluster.

Common problems

Nodes are down in Slurm

Check:

sinfo
scontrol show node node1
sudo journalctl -u slurmd -n 100
sudo journalctl -u slurmctld -n 100

Common causes:

hostname mismatch
bad /etc/hosts
Munge not running
wrong Munge key
wrong CPU or memory values in slurm.conf
firewall issues

Fix Slurm first.

Multi-node Slurm test does not show all nodes

Check the job request:

#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1

Check the partition:

sinfo

Check that nodes are idle or available.

Spark will not run multi-node if Slurm is only giving one node.

NFS mount is missing on one node

Check:

srun ls /shared

If one node fails, fix /etc/fstab or the NFS mount on that node.

Also check:

showmount -e master
sudo mount -a

Spark workers need the same paths.

Spark workers do not connect to master

Check that nodes can resolve the master node name:

srun getent hosts "$master_node"

Check common Spark ports:

  Spark master
  Spark master web UI
  Spark worker web UI

If the firewall blocks node-to-node traffic, workers may not connect.

Spark only runs locally

Do not use:

spark-submit --master local[*] job.py

Use:

spark-submit --master "$master_url" job.py

Also check that workers actually started.

Look in:

/shared/logs/spark-<jobid>

Spark cannot read input files

Check that all nodes can see the same input:

srun ls /shared/data

If the path exists only on one node, Spark will fail or behave strangely.

Use shared storage.

Keep the path identical on all nodes.

Spark runs out of memory

Reduce:

--executor-memory

Also reduce the amount of data per partition or increase the partition count.

For PySpark, remember:

JVM memory + Python memory + overhead

Do not size memory too close to the Slurm limit.

The job hangs during startup

Common causes:

Spark master not ready yet
workers cannot reach the master
firewall blocks Spark ports
wrong hostname
Java missing on one node
Spark path differs between nodes
/shared is not mounted on one node

Check:

cat spark-<jobid>.err
ls /shared/logs/spark-<jobid>

Then check Slurm logs if needed.

Optional GPU notes

You can run this on GPU nodes too.

But Spark does not magically use GPUs.

Slurm can allocate GPUs:

#SBATCH --partition=gpu
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-node=2
#SBATCH --mem=64G

Spark can be told about GPU resources:

spark-submit \
    --master "$master_url" \
    --conf spark.executor.resource.gpu.amount=2 \
    --conf spark.task.resource.gpu.amount=1 \
    --executor-cores 16 \
    --executor-memory 48G \
    ./gpu_job.py

But the application must actually use CUDA.

Check GPU visibility:

srun nvidia-smi

For PyTorch:

import torch

print(torch.cuda.is_available())
print(torch.cuda.device_count())

If GPUs are not visible, fix Slurm and NVIDIA first.

Spark is not the first thing to blame.

Final shape

The whole setup is just this:

Slurm cluster works.
Multi-node Slurm job works.
Shared storage works.
Spark is installed on all nodes.
Slurm allocates nodes.
Spark starts inside the allocation.
Spark reads and writes shared data.
Spark runs the data job.
Spark stops.
Slurm releases the nodes.

Use raw Slurm for independent jobs.

Use Postgres for database-shaped problems.

Use Spark on Slurm for coordinated data jobs over shared datasets.

That is the practical boundary.

Keep that boundary clear and the setup stays simple.