Versionen im Vergleich

Schlüssel

  • Diese Zeile wurde hinzugefügt.
  • Diese Zeile wurde entfernt.
  • Formatierung wurde geändert.
Inhalt
stylenone

Preface

In September 2024 NHR@ZIB replaces the global file systems HOME and WORK in all systems of Lise in Berlin. This affects all login nodes and all Compute partitions. Please be aware of the following activities.

  • For September 2024 we plan a 3 - day maintenance to switch to the new file systems HOME and WORK.

  • All data in the HOME file system will be copied by NHR@ZIB. The user does not have to do anything on HOME.

  • The data in the WORK file system will not be copied automatically. Within a time period of three weeks, the users will have the opportunity to copy A user has the opportunity to migrate data from the old WORK to the new WORK file system.

Current migration state

...

nodes

...

CentOS 7

...

Rocky Linux 9

...

login

...

blogin[1-6]

...

blogin[7-8]

...

compute (384 GB RAM)

...

832

...

112

...

compute (768 GB RAM)

...

32

...

0

...

compute (1536 GB RAM)

...

2

...

0

...

Time schedule, last update July 15th

date

subject

2024-07-03

official start of the migration phase with 2 login and 112 compute nodes running Rocky Linux 9

What has changed

SLURM partitions

...

CentOS 7

...

Rocky Linux 9

...

old partition name

...

new partition name

...

current job limits

...

standard96

...

cpu-clx

...

40 nodes, 12h wall time

...

standard96:test

...

cpu-clx:test

...

16 nodes, 1 h wall time

...

standard96:ssd

...

cpu-clx:ssd

...

large96

...

cpu-clx:large

...

large96:test

...

large96:shared

...

huge96

...

cpu-clx:huge

( available closed/not available yet )

Software and environment modules

...

CentOS 7

...

Rocky Linux 9

...

OS components

...

glibc 2.17

...

glibc 2.34

...

Python 3.6

...

Python 3.9

...

GCC 4.8

...

GCC 11.4

...

bash 4.2

...

bash 5.1

...

check disk quota

...

hlrnquota

...

show-quota

...

Environment modules version

...

4.8 (Tmod)

...

5.4 (Tmod)

...

Modules loaded initially

...

HLRNenv

...

NHRZIBenv

...

slurm

...

slurm

...

sw.skl

...

sw.clx.el9

...

compiler modules

...

intel ≤ 2022.2.1

...

intel ≥ 2024.2

...

gcc ≤ 13.2.0

...

gcc ≥ 13.3.0

...

MPI modules

...

impi ≤ 2021.7.1

...

impi ≥ 2021.13

...

openmpi ≤ 4.1.4

...

openmpi ≥ 5.0.3

Shell environment variables

CentOS 7

Rocky Linux 9

TMPDIR=/scratch/tmp/$USER

(undefined, local /tmp is used)

SLURM_MPI_TYPE=pmi2

SLURM_MPI_TYPE=pmix

(undefined)

I_MPI_ROOT=/sw/comm/impi/mpi/latest

I_MPI_PMI_LIBRARY=<path-to>/libpmix.so

(undefined)

NHRZIB_ARCH=clx

NHRZIB_OS=el9

NHRZIB_TARGET=clx.el9

What remains unchanged

  • node hardware and node names

  • communication network (Intel Omnipath)

  • file systems (HOME, WORK, PERM) and disk quotas

  • environment modules system (still based on Tcl, a.k.a. “Tmod”)

  • access credentials (user IDs, SSH keys) and project IDs

  • charge rates and CPU time accounting (early migrators' jobs are free of charge)

  • Lise’s Nvidia-A100 and Intel-PVC partitions

Special remarks

  • For users of SLURM’s srun job launcher:
    Open MPI 5.x has dropped support for the PMI-2 API, it solely depends on PMIx to bootstrap MPI processes. For this reason the environment setting was changed from SLURM_MPI_TYPE=pmi2 to SLURM_MPI_TYPE=pmix, so binaries linked against Open MPI can be started as usual “out of the box” using srun mybinary. For the case of a binary linked against Intel-MPI, this works too when a recent version (≥2021.11) of Intel-MPI has been used. If an older version of Intel-MPI has been used, and relinking/recompiling is not possible, one can follow the workaround for PMI-2 with srun as described in the Q&A section below. Switching from srun to mpirun instead should also be considered.

  • Using more processes per node than available physical cores (PPN > 96; hyperthreads) with the OPX provider:
    The OPX provider currently does not support using hyperthreads/PPN > 96 on the clx partitions. Doing so may result in segmentation faults in libfabric during process startup. If a high number of PPN is really required, the libfabric provider has to be changed to PSM2 by setting FI_PROVIDER=psm2. Note that the usage of hyperthreads may not advisable. We encourage users to test performance before using more threads than available physical cores.

Action items for users

All users of Lise are recommended to

  • log in to an already migrated login node (see the current state table) and get familiar with the new environment

  • check self-compiled software for continued operability

  • relink/recompile software as needed

  • adapt and test job scripts and workflows

  • submit test jobs to the new "cpu-clx:test" SLURM partition

  • read the Q&A section and ask for support in case of further questions, problems, or software requests (support@nhr.zib.de)

Questions and answers

Erweitern
titleI cannot find the "mycode/1.2.3" environment module anymore. I need it, what happened to it?

There can be several reasons. Maybe our installation of Rocky Linux 9 already includes the “mycode” package in a version newer than “1.2.3”. Or maybe we provide an updated environment module “mycode/2.0” instead. Or maybe we did not consider continuing “mycode” under Rocky Linux 9 yet, in this case please submit a support request.

Erweitern
titleDid you delete all the user software you have provided under CentOS 7?

No. Though environment modules prepared under CentOS 7 might not be available anymore, the actual software they have been pointing at is still available under the “/sw” file system.

Erweitern
titleIs there a way to switch back from PMIx to PMI-2 with srun?

Yes. Simply say unset I_MPI_PMI_LIBRARY and export SLURM_MPI_TYPE=pmi2 before invoking srun. (Be prepared that problems arise from this for binaries linked against Open MPI 5.x.)

Erweitern
titleWhy do I receive "MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found"?

This is because we need to define I_MPI_PMI_LIBRARY to ensure that the PMIx interface of srun works also for Intel-MPI. Obviously, you prefer mpirun over srun, and in this case you can manually unset I_MPI_PMI_LIBRARY to avoid this (harmless) warning message.

...

titleI have loaded the "intel/2024.2" environment module, but still neither the icc nor the icpc compiler is found. Why that?

Starting with the 2022.2 release of Intel’s oneapi toolkits, the icc and icpc “classic” compilers (C/C++) have been marked as “deprecated”, see here. Corresponding user warnings

icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.

...

Aug 26th

email update for migration plan

Sept

3 day maintenance for replacement of HOME and WORK

Sept/Oct

3 week period to copy data from old to new WORK

Data migration for WORK

Data migration for the WORK file system follows the following steps.

  • No data in the WORK file system will not be copied automatically by NHR@ZIB.

  • Within a time period of three weeks, a user can migrate data from the old WORK to the new WORK file system.