PVC MPI Usage

PVC MPI Usage

For the PVC partition, essentially two options are available to run parallel workloads on the GPUs.

  1. Using Slurm

  2. Using Intel MPI

Please note that Slurm “only” respects physical Intel GPUs. If you want to make use of two tiles/stacks of such a GPU, this is subject to the settings for the Level Zero runtime. For details refer to the oneAPI-Documentation. It is recommended to consider Intel MPI for use-cases where the two tiles/stacks are being used individually.

Slurm

Our Slurm installation natively supports the Intel PVC GPUs. Use resource specifications to specify the amount GPUs you want to use, e.g. --gpus-per-tasks, --gpus and according directives for sbatch / srun. Refer to the Slurm documentation for details.

To check whether the intended GPUs have been assigned to the individual processes, invoke xpu-smi discovery and search for the device file that is assigned, e.g:

#!/bin/bash #SBATCH --partition=gpu-pvc #SBATCH --nodes=2 #SBATCH --gpus-per-task=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=pin-check srun --label hostname | sort -n srun --label xpu-smi discovery | grep 'Device:' | sort -n

Which should result in an output like the following:

0: bgi1002 1: bgi1002 2: bgi1002 3: bgi1002 4: bgi1008 5: bgi1008 6: bgi1008 7: bgi1008 0: | | DRM Device: /dev/dri/card0 | 1: | | DRM Device: /dev/dri/card2 | 2: | | DRM Device: /dev/dri/card1 | 3: | | DRM Device: /dev/dri/card3 | 4: | | DRM Device: /dev/dri/card0 | 5: | | DRM Device: /dev/dri/card2 | 6: | | DRM Device: /dev/dri/card1 | 7: | | DRM Device: /dev/dri/card3 |

Intel MPI

In addition to Slurm, Intel MPI can be used for running parallel multi-GPU workloads.

To make use of Intel MPI, load an impi environment module, to make the library available.

Note that Slurm’s Intel GPU support interferes with Intel MPI GPU pinning support. To work around this, it must be ensured that the environment variable ZE_AFFINITY_MASK is not set for the processes launched by Intel MPI. For achieve this, the process bootstrap mechanism is switched to something different than slurm (ssh in the example below) and no additional arguments are passed to the process bootstrap.

Further, to enable GPU support, set the environment variable I_MPI_OFFLOAD to "1" (in your jobscript). In case you make use of GPUs on multiple nodes, it is strongly recommended to use the psm3 libfabric provider (FI_PROVIDER=psm3)

Depending on your application’s needs set I_MPI_OFFLOAD_CELL to either tile or device to assign each MPI rank either a tile or the whole GPU device.

Again, it is recommended to check the pinning by setting I_MPI_DEBUG to (at least) 3 and I_MPI_OFFLOAD_PRINT_TOPOLOGY to 1.

Refer to the Intel MPI documentation on GPU support for further information.

Example Job Script:

#!/bin/bash # example to use use 2 x (2 x 4) = 16 MPI processes, each assigned # to one of the two tiles (stacks) of an PVC GPU #SBATCH --partition=gpu-pvc #SBATCH --nodes=2 #SBATCH --gpus-per-node=4 #SBATCH --ntasks-per-node=8 #SBATCH --job-name=pin-check # required for usage of Intel GPUs module load intel # required for MPI, apparently module load impi/2021.15 # Workaround to account for Slurm's GPU support which interferes with Intel MPI's unset ZE_AFFINITY_MASK unset I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS export I_MPI_HYDRA_BOOTSTRAP=ssh # required for GPU usage with MPI export FI_PROVIDER=psm3 # to enable GPU support in Intel MPI export I_MPI_OFFLOAD=1 # assign each rank a tile of a GPU export I_MPI_OFFLOAD_CELL=tile # for checking the process pinning export I_MPI_DEBUG=3 export I_MPI_OFFLOAD_PRINT_TOPOLOGY=1 mpirun ./application

The resulting pinning should look like this (and identical on the second node). Please consider a pinning strategy that matches your application needs.

[0] MPI startup(): ===== GPU topology on bgi1002 ===== [0] MPI startup(): NUMA nodes : 2 [0] MPI startup(): GPUs : 4 [0] MPI startup(): Stacks (Tiles) : 8 [0] MPI startup(): Hierarchy mode : flat [0] MPI startup(): Backend : level zero [0] MPI startup(): Device name : Intel(R) Data Center GPU Max 1550 [0] MPI startup(): NUMA Id GPU Id Stacks (tiles) Ranks on this NUMA [0] MPI startup(): 0 0,1 (0,1)(2,3) 0,1,2,3 [0] MPI startup(): 1 2,3 (4,5)(6,7) 4,5,6,7 [0] MPI startup(): ===== GPU pinning on bgi1002 ===== [0] MPI startup(): Rank Pin stack (tile) [0] MPI startup(): 0 {0} [0] MPI startup(): 1 {1} [0] MPI startup(): 2 {2} [0] MPI startup(): 3 {3} [0] MPI startup(): 4 {4} [0] MPI startup(): 5 {5} [0] MPI startup(): 6 {6} [0] MPI startup(): 7 {7}