Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
User Manual

User Manual
Results will update as you type.
  • Application Guide
  • Status of System
  • Usage Guide
  • Compute partitions
    • CPU CLX partition
      • Workflow CPU CLX
      • Slurm partition CPU CLX
      • Examples and Recipes
      • Fat Tree OPA network of CLX partition
      • Operating system migration from CentOS to Rocky Linux
    • CPU Genoa partition
    • GPU A100 partition
    • GPU PVC partition
    • Next-Gen Technology Pool
  • Software
  • FAQ
  • NHR Community
  • Contact

    You‘re viewing this with anonymous access, so some content might be blocked.
    /
    Operating system migration from CentOS to Rocky Linux
    Updated Nov. 08, 2024

    Operating system migration from CentOS to Rocky Linux

    • 1 Preface
    • 2 Current migration state: complete
    • 3 Latest news
    • 4 What has changed
      • 4.1 SLURM partitions
      • 4.2 Software and environment modules
      • 4.3 Shell environment variables
    • 5 What remains unchanged
    • 6 Special remarks
    • 7 Action items for users
    • 8 Questions and answers

    Preface

    The operating system “CentOS 7” has reached its end of life. For this reason Lise’s CPU partition will be updated to “Rocky Linux 9”. This affects all login and compute nodes equipped with Intel Xeon Cascade Lake processors ("clx" for short). Lise's GPU-A100 partition and GPU-PVC partition are not affected.

    It is important for users to follow the action items specified below. Rocky Linux 9 introduces new versions of various system tools and libraries. Some codes compiled earlier under CentOS 7 might not be working under Rocky Linux 9 anymore. Thus, legacy versions of environment modules offered under CentOS 7 were not transferred to the new OS environment or have been replaced by more recent versions.

    The migration to the new OS is organised in three consecutive phases. It is expected to be complete by the end of July September.

    1. The first phase starts with 2 login nodes and 112 compute nodes already migrated to Rocky Linux 9 for testing. The other nodes remain available under CentOS 7 for continued production.

    2. After the test phase, a major fraction of nodes will be switched to Rocky Linux 9 to allow for general job production under the new OS.

    3. During the last phase, only a few nodes still remain under CentOS 7. At the very end, they will be migrated to Rocky Linux 9, too.

    During the migration phase the use of Rocky Linux 9 "clx" compute nodes will be was free of charge.

    Current migration state: complete

    nodes

    CentOS 7

    Rocky Linux 9

    nodes

    CentOS 7

    Rocky Linux 9

    login

    -

    blogin[1-8]

    compute (384 GB RAM)

    -

    948

    compute (768 GB RAM)

    -

    32

    compute (1536 GB RAM)

    -

    2

    Latest news

    date

    subject

    date

    subject

    2024-09-30

    migration of remaining nodes from CentOS 7 to Rocky Linux 9

    2024-09-16

    generic login name “blogin” resolves to blogin[3-6]

    2024-08-14

    migration of blogin[3-6]

    2024-07-30

    migration of another 576 standard compute nodes to Rocky Linux 9

    2024-07-03

    official start of the migration phase with 2 login and 112 compute nodes running Rocky Linux 9

    What has changed

    SLURM partitions

    CentOS 7

    Rocky Linux 9

    CentOS 7

    Rocky Linux 9

    old partition name

    new partition name

    current job limits

    ● standard96

    ● cpu-clx

    512 nodes, 12 h wall time

    ● standard96:test

    ● cpu-clx:test

    16 nodes, 1 h wall time

    ● standard96:ssd

    ● cpu-clx:ssd

    50 nodes, 12 h wall time

    ● large96

    ● cpu-clx:large

    32 nodes, 48 h wall time

    ● large96:test

     

     

    ● large96:shared

     

     

    ● huge96

    ● cpu-clx:huge

    1 node, 48 h wall time

    ( ● available ● closed/not available yet )

    Jobs submitted without a partition name are placed in the default partition. The old default was standard96, the new default is cpu-clx.

    Software and environment modules

     

    CentOS 7

    Rocky Linux 9

     

    CentOS 7

    Rocky Linux 9

    OS components

    glibc 2.17

    glibc 2.34

    Python 3.6

    Python 3.9

    GCC 4.8

    GCC 11.4

    bash 4.2

    bash 5.1

    check disk quota

    hlrnquota

    show-quota

    Environment modules version

    4.8 (Tmod)

    5.4 (Tmod)

    Modules loaded initially

    HLRNenv

    NHRZIBenv

    slurm

    slurm

    sw.skl

    sw.clx.el9

    compiler modules

    intel ≤ 2022.2.1

    intel ≥ 2024.2

    gcc ≤ 13.2.0

    gcc ≥ 13.3.0

    MPI modules

    impi ≤ 2021.7.1

    impi ≥ 2021.13

    openmpi ≤ 4.1.4

    openmpi ≥ 5.0.3

    …

    …

    …

    Shell environment variables

    CentOS 7

    Rocky Linux 9

    CentOS 7

    Rocky Linux 9

    TMPDIR=/scratch/tmp/$USER

    (undefined, local /tmp is used)

    SLURM_MPI_TYPE=pmi2

    SLURM_MPI_TYPE=pmix

    (undefined)

    I_MPI_ROOT=/sw/comm/impi/mpi/latest

    I_MPI_PMI_LIBRARY=<path-to>/libpmix.so

    (undefined)

    NHRZIB_ARCH=clx

    NHRZIB_OS=el9

    NHRZIB_TARGET=clx.el9

    What remains unchanged

    • node hardware and node names

    • communication network (Intel Omnipath)

    • file systems (HOME, WORK, PERM) and disk quotas

    • environment modules system (still based on Tcl, a.k.a. “Tmod”)

    • access credentials (user IDs, SSH keys) and project IDs

    • charge rates and CPU time accounting (early migrators' jobs are were free of charge)

    • Lise’s Nvidia-A100 and Intel-PVC partitions

    Special remarks

    • For users of SLURM’s srun job launcher:
      Open MPI 5.x has dropped support for the PMI-2 API, it solely depends on PMIx to bootstrap MPI processes. For this reason the environment setting was changed from SLURM_MPI_TYPE=pmi2 to SLURM_MPI_TYPE=pmix, so binaries linked against Open MPI can be started as usual “out of the box” using srun mybinary. For the case of a binary linked against Intel-MPI, this works too when a recent version (≥2021.11) of Intel-MPI has been used. If an older version of Intel-MPI has been used, and relinking/recompiling is not possible, one can follow the workaround for PMI-2 with srun as described in the Q&A section below. Switching from srun to mpirun instead should also be considered.

    • Using more processes per node than available physical cores (PPN > 96; hyperthreads) when defining FI_PROVIDER=opx:
      The OPX provider currently does not support using hyperthreads/PPN > 96 on the clx partitions. Doing so may result in segmentation faults in libfabric during process startup. If a high number of PPN is really required, the libfabric provider has to be changed back to PSM2 by re-defining FI_PROVIDER=psm2(which is the default setting). Note that the usage of hyperthreads may not be advisable. We encourage users to test performance before using more threads than available physical cores.
      Note that Open MPI’s mpirun/exec defaults to use all hyperthreads if a Slurm job/allocation is used that does not explicitly set --ntasks-per-node (or similar options).

    Action items for users

    All users of Lise are recommended to

    • log in to an already migrated login node (see the current state table), for example to blogin7.nhr.zib.de (fully qualified domain name for external ssh connections to Lise) or simply blogin7 (from within Lise)

    • get familiar with the new environment (check module avail)

    • check self-compiled software for continued operability

    • relink/recompile software as needed

    • adapt and test job scripts and workflows

    • submit test jobs to the new "cpu-clx:test" SLURM partition

    • read the Q&A section and ask for support in case of further questions, problems, or software requests (support@nhr.zib.de)

    Questions and answers

    There can be several reasons. Maybe an environment module is not needed any longer because our installation of Rocky Linux 9 already includes the “mycode” package in a version newer than “1.2.3”. Or maybe we provide an updated environment module “mycode/2.0” instead, please check with module avail. Or maybe we did not consider continuing “mycode” under Rocky Linux 9 yet, in this case please submit a support request.

    No. Though environment modules prepared under CentOS 7 might not be available anymore, the actual software they have been pointing at is still available under the “/sw” file system.

    Yes. Simply say unset I_MPI_PMI_LIBRARY and export SLURM_MPI_TYPE=pmi2 before invoking srun. (Be prepared that problems arise from this for binaries linked against Open MPI 5.x.)

    This is because we need to define I_MPI_PMI_LIBRARY to ensure that the PMIx interface of srun works also for Intel-MPI. Obviously, you prefer mpirun over srun, and in this case you can manually unset I_MPI_PMI_LIBRARY to avoid this (harmless) warning message.

    Starting with the 2022.2 release of Intel’s oneapi toolkits, the icc and icpc “classic” compilers (C/C++) have been marked as “deprecated”, see here. Corresponding user warnings

    icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.

    were generated when using icc or icpc. The 2024.x releases of the Intel oneapi toolkits do not contain icc and icpc any longer. Users need to switch to Intel’s “next generation” icx and icpx compilers, respectively. They accept almost all of the “classic compiler” switches. More information is available from Intel’s porting guide.

    No, your ssh key remains valid. We have seen this kind of problem for Windows users with an outdated version of PuTTY. Updating to a more recent PuTTY (≥ 0.81) solved this problem. The same holds for WinSCP which needs to be up-to-date, too.

    This behaviour is observed for jobs submitted to the old CentOS 7 partitions (see the table above). Please make sure you submit such jobs on blogin1 or on blogin2 which currently still run CentOS, too. The generic node name “blogin” resolves to login nodes already running Rocky Linux 9 - they should not be used for job submissions to the old CentOS 7 partitions.

    {"serverDuration": 9, "requestCorrelationId": "4f1267d552c64950a5514ca24277a66c"}