Multiple programs on one node
Using srun to create multiple jobs steps
You can use srun
to start multiple job steps concurrently on a single node, e.g. if your job is not big enough to fill a whole node. There are a few details to follow:
- By default, the
srun
command gets exclusive access to all resources of the job allocation and uses all tasks- you therefore need to limit
srun
to only use part of the allocation - this includes implicitly granted resources, i.e. memory and GPUs
- the
--exact
flag is needed. - if running non-mpi programs, use the
-c
option to denote the number of cores, each process should have access to
- you therefore need to limit
srun
waits for the program to finish, so you need to start concurrent processes in the backgroundGood default memory per cpu values (without hyperthreading) are usually are:
standard96 large96 huge96 large40/gpu --mem-per-cpu
3770M
7781M 15854M 19075M
Examples
#!/bin/bash #SBATCH -p standard96 #SBATCH -t 06:00:00 #SBATCH -N 1 srun --exact -n1 -c 10 --mem-per-cpu 3770M ./program1 & srun --exact -n1 -c 80 --mem-per-cpu 3770M ./program2 & srun --exact -n1 -c 6 --mem-per-cpu 3770M ./program3 & wait
#!/bin/bash #SBATCH -p gpu #SBATCH -t 12:00:00 #SBATCH -N 1 srun --exact -n1 -c 10 -G1 --mem-per-cpu 19075M ./single-gpu-program & srun --exact -n1 -c 10 -G1 --mem-per-cpu 19075M ./single-gpu-program & srun --exact -n1 -c 10 -G1 --mem-per-cpu 19075M ./single-gpu-program & srun --exact -n1 -c 10 -G1 --mem-per-cpu 19075M ./single-gpu-program & wait
Using the Linux parallel command to run a large number of tasks
If you have to run many nearly identical but small tasks (single-core, little memory) you can try to use the Linux parallel command. To use this approach you first need to write a bash
-shell script, e.g. task.sh
, which executes a single task. As an example we will use the following script:
#!/bin/bash # parallel task TASK_ID=$1 PARAMETER=$((10+RANDOM%10)) # determine some parameter unique for this task # often this will depend on the TASK_ID echo -n "Task $TASK_ID: sleeping for $PARAMETER seconds ... " sleep $PARAMETER echo "done"
This script is simply defining a variable PARAMETER
which then used as the input for the actual command, which is sleep
in this case. The script also takes one input parameter, which can be interpreted as the TASK_ID
and could also be used for determining the PARAMETER
. If we make the script executable and run it as follows, we get:
$ chmod u+x task.sh $ ./task.sh 4 Task 4: sleeping for 11 seconds ... done
To now run this task this task 100 times with different TASK_ID
s we can write the following job script:
#!/bin/bash #SBATCH --partition standard96:test # adjust partition as needed #SBATCH --nodes 1 # more than 1 node can be used #SBATCH --tasks-per-node 96 # one task per CPU core, adjust for partition # set memory available per core MEM_PER_CORE=4525 # must be set to value that corresponds with partition # see https://www.hlrn.de/doc/display/PUB/Multiple+concurrent+programs+on+a+single+node # Define srun arguments: srun="srun -n1 -N1 --exclusive --mem-per-cpu $MEM_PER_CORE" # --exclusive ensures srun uses distinct CPUs for each job step # -N1 -n1 allocates a single core to each task # Define parallel arguments: parallel="parallel -N 1 --delay .2 -j $SLURM_NTASKS --joblog parallel_job.log" # -N number of argument you want to pass to task script # -j number of parallel tasks (determined from resources provided by Slurm) # --delay .2 prevents overloading the controlling node on short jobs # --resume add if needed to use joblog to continue an interrupted run (job resubmitted) # --joblog creates a log-file, required for resuming # Run the tasks in parallel $parallel "$srun ./task.sh {1}" ::: {1..100} # task.sh executable(!) script with the task to complete, may depend on some input parameter # ::: {a..b} range of parameters, alternatively $(seq 100) should also work # {1} parameter from range is passed here, multiple parameters can be used with # additional {i}, e.g. {2} {3} (refer to parallel documentation)
The script use parallel
in line 25 to run task.sh
100 times with a parameter taken from the range {1..100
}. Because each task is started with srun
a separate job step is created and the options used with srun
(see line 12) the task is using only a single core. This simple example can be adjusted as needed by modifying the script task.sh
and the job script parallel_job.sh
. You can adjust the requested resources, for example, you can use more than a single node. Note that depending on the number of tasks you may have to split your job into several to keep the total time needed short enough. Once the setup is done, you can simply submit the job:
$ sbatch parallel_job.sh
Looping over two arrays
You can use parallel
to loop over multiple arrays. The --xapply
option controls, if all permuatations are used or not:
Doing local I/O tasks in parallel
To distribute data from a global location ($WORK, $HOME) to several nodes simultaneously - similar to a MPI_Bcast - one can use:
pdcp -r -w $SLURM_NODELIST $WORK/input2copy/* $LOCAL_TMPDIR
$LOCAL_TMPDIR exists - only "node" locally - on all compute nodes (see Special Filesystems for more details).
To collect individual data from several node-local locations simultaneously - similar to a MPI_Gather - one can use:
rpdcp -r -w $SLURM_NODELIST $LOCAL_TMPDIR/output2keep/* $WORK/returndir
Automatically, rpdcp will rename the data by appending the local hostname of its origin. This avoids overwriting of files with the same name.
In the next example local output data ($LOCAL_TMPDIR) is moved back to $WORK while the main program is still running. With "&
" you send the main program in the background.
#!/bin/bash #SBATCH --partition=cpu-genoa:test #SBATCH --nodes=2 #SBATCH --tasks-per-node=3 #SBATCH --job-name=auto_loc_io_test #SBATCH --output=%x.%j.out echo "" echo "Used environment variables:" echo "WORK" $WORK echo "LOCAL_TMPDIR" $LOCAL_TMPDIR # Auto set by Slurm prolog: /local/$USER_$SLURM_JOB_ID echo "" echo "Used slurm variables:" echo "SLURM_JOB_NAME" $SLURM_JOB_NAME echo "SLURM_JOB_ID" $SLURM_JOB_ID echo "SLURM_NODELIST" $SLURM_NODELIST echo "" ### prepare case # master dir. on global filesystem # can be created by master node (= directly this script) MASTERDIR=$WORK/$SLURM_JOB_NAME"."$SLURM_JOB_ID echo "All job data is (collected) here: $MASTERDIR" # subdirectories of master dir. # can be created by master node (= directly this script) INPUT_DIR=$MASTERDIR/in OUTPUT_DIR=$MASTERDIR/allout mkdir -p $INPUT_DIR mkdir -p $OUTPUT_DIR echo "example data inside input file" > $INPUT_DIR/input_data.dat # $LOCAL_TMPDIR (aka /local/$USER_$SLURM_JOB_ID) # is dir. on node-local ssd or ram if no ssd is present # only exist during job lifetime # subdirectories of $LOCAL_TMPDIR # can only be created node-locally (= "child" srun -w node) LOC_OUT_DIR=$LOCAL_TMPDIR/out # prepare script with empty 10 second job cat <<EOF > $INPUT_DIR/dummyjob.sh #!/bin/bash echo "Main job started on \$SLURMD_NODENAME in \$PWD." sleep 10 mkdir -p $LOC_OUT_DIR hostname > $LOC_OUT_DIR/data.\$SLURMD_NODENAME # important to use unique local name in order to mv it to global dir. without overwriting echo "\$SLURMD_NODENAME main job finished. Data written to $LOC_OUT_DIR." EOF chmod u+x $INPUT_DIR/dummyjob.sh # command to cp data from global master to tempory local # before mv -> check if dir is present and if not empty (=data to copy inside) loc2mst_cmd_string='if [ -d '$LOC_OUT_DIR' ]; then if ! [ $(find '$LOC_OUT_DIR' -maxdepth 0 -empty) ]; then mv '$LOC_OUT_DIR'/* '$OUTPUT_DIR'; fi; fi' # All ifs/checks are only needed to avoid warnings triggered for example if no data to mv is present... ### cp input files to node-local directories (in parallel) pdcp -r -w $SLURM_NODELIST $INPUT_DIR/* $LOCAL_TMPDIR echo "Relevant data is copied to node-local locations. Main job is starting..." ### execute main job itself: e.g. mpirun or srun srun $LOCAL_TMPDIR/dummyjob.sh & main_pid=$! echo "Main job running. Start copying node-local data back to global filesystem in the background - every three seconds..." while ps -p $main_pid &>/dev/null; do echo "Capacity of \$LOCAL_TMPDIR is:" df -h $LOCAL_TMPDIR pdsh -w $SLURM_NODELIST "$loc2mst_cmd_string" echo "loc2mst cmd executed. New capacity of \$LOCAL_TMPDIR is:" df -h $LOCAL_TMPDIR sleep 3 done wait echo "Main job finished. Copying remaining node-local data (if any)." pdsh -w $SLURM_NODELIST "$loc2mst_cmd_string" echo "All (parallel) copy jobs are done. All data is here: $MASTERDIR"