Inhalt

General information for the usage if partition GPU A100 you find on Quickstart, especially for all Lise partitions you can find for the topics

login via ssh on Usage Guide,
file systems on Usage Guide, and
general Slurm usage.
general slurm properties on Slurm usage.

Hardware

The GPU A100 partition offers access to two login nodes and 42 compute nodes equipped with Nvidia A100 GPUs. One single compute node holds the following properties.

2x Intel Xeon "Ice Lake" Platinum 8360Y (36 cores per socket, 2.4 GHz, 250 W)
1 TB RAM (DDR4-3200)
4x Nvidia A100 (80GB HBM2, SXM), two attached to each CPU socket
7.68 TB NVMe local SSD
200 GBit/s InfiniBand Adapter (Mellanox MT28908)

Login nodes

The hardware of the login nodes nodes is similar to those of the compute nodes. Notable differences to the compute nodes are

reduced main memory (512 GB instead of 1 TB RAM) and
no GPUs and no CUDA drivers.

Login authentication is possible via SSH keys only. Please visit Usage Guide.

Generic login name	List of login nodes
bgnlogin.nhr.zib.de	bgnlogin1.nhr.zib.de bgnlogin2.nhr.zib.de

Software and environment modules

Login and compute nodes of the A100 GPU partition are running under Rocky Linux (currently version 8.6).
Software for the A100 GPU partition provided by NHR@ZIB can be found using the module command, see Usage Guide.
Please note the presence of the sw.a100 environment module. It controls the software selection for the GPU A100 partition.

Codeblock

language	text
title	Example: Show the currently available software and access compilers

bgnlogin1 $ module avail
...
bgnlogin1 $ module load gcc
...
bgnlogin1 $ module list
Currently Loaded Modulefiles:
 1) HLRNenv   2) sw.a100   3) slurm   4) gcc/11.3.0(default)

Program build and execution

Each node of the GPU A100 system is a combination of a host CPU and their four attached device GPUs. There is a wide range of software to support this hardware.
We recommend to use the GPU A100 login nodes for program build. If a program build needs for the presence of CUDA drivers, compilation is possible on a compute node within a slurm job session, too.We restrict our presentation to examples. For that For build examples, please visit our manual on
- OpenMP for GPU A100,
- CUDA.
GPU-aware MPI: For efficient use of MPI-distributed GPU codes, an GPU/CUDA-aware MPI installation of Open MPI is available in the openmpi/gcc.11/4.1.4 environment module. Open MPI respects the resource requests made to Slurm. Thus, no special arguments are required to mpiexec/run. Nevertheless, please consider and check the correct binding for your application to CPU cores and GPUs. Use --report-bindings of mpiexec/run to check it.

Job monitoring

A running job can be monitored interactively, directly on each of the compute nodes. Once you know the names of the job nodes you can login and monitor the host CPU as well as the GPUs.

Codeblock

title	Job monitoring

bgnlogin1 $ squeue -u myaccount
  JOBID PARTITION     NAME      USER ST TIME  NODES NODELIST(REASON)
7748370  gpu-a100 a100_mpi myaccount  R 1:23      2 bgn[1007,1017]
bgnlogin1 $ ssh bgn1007
bgn1007 $ top
bgn1007 $ nvidia-smi
bgn1007 $ module load nvtop
bgn1007 $ nvtop

Software and environment modules

Login and compute nodes of the A100 GPU partition are running under Rocky Linux (currently version 8.6).
Software for the A100 GPU partition provided by NHR@ZIB can be found using the module command, see Quickstart.
Please note the presence of the sw.a100 environment module. It controls the software selection for the GPU A100 partition.

Example: Show the currently available software and access compilers

Codeblock
language	text
title
bgnlogin1 $ module avail ... bgnlogin1 $ module load gcc ... bgnlogin1 $ module list Currently Loaded Modulefiles: 1) HLRNenv 2) sw.a100 3) slurm 4) gcc/11.3.0(default)

Using the slurm batch system

...

Codeblock

language	text
title	GPU job script

#!/bin/bash
#SBATCH --partition=gpu-a100
#SBATCH --nodes=2
#SBATCH --ntasks=8 
#SBATCH --gres=gpu:4

module load openmpi/gcc.11/4.1.4
mpirun ./mycode.bin

Container

Apptainer is provided as a module and can be used to download, build and run e.g. Nvidia containers:

Codeblock

language	bash
title	Apptainer example

bgnlogin1 ~ $ module load apptainer
Module for Apptainer 1.1.6 loaded.

#pulling a tensorflow image from nvcr.io - needs to be compatible to local driver
bgnlogin1 ~ $ apptainer pull tensorflow-22.01-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:22.01-tf2-py3
...

#example: single node run calling python from the container in interactive job using 4 GPUs
bgnlogin1 ~ $ srun -pgpu-a100 --gres=gpu:4 --nodes=1 --pty --interactive --preserve-env ${SHELL}
...
bgn1003 ~ $ apptainer run --nv tensorflow-22.01-tf2-py3.sif python
...
Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.config.list_physical_devices("GPU")
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]

#optional: cleanup apptainer cache
bgnlogin1 ~ $ apptainer cache list
...
bgnlogin1 ~ $ apptainer cache clean

Versionen im Vergleich

Alte Version 30

Neue Version Aktuell

Schlüssel

Hardware

Login nodes

Software and environment modules

Program build and execution

Job monitoring

Software and environment modules

Using the slurm batch system

Container

Seitenvergleich

Versionen im Vergleich

Alte Version 30

Neue Version Aktuell

Schlüssel

Hardware

Login nodes

Software and environment modules

Program build and execution

Job monitoring

Software and environment modules

Using the slurm batch system

Container