Hostnames: grace[1-2]
CPU | 2x NVIDIA Grace CPU Superchip (144 cores total) |
---|---|
RAM | 480 GB LPDDR5X-4800Mhz ECC |
Disk | 2TB |
Interconnect | Infiniband NDR link between grace1 and grace2 |
Hostnames: apass{1,2}
The two systems apass{1,2} are equipped with Intel Optane Memory components (first generation Apache Pass): Storage Class Memory modules (SCM/NVRAM) and SSDs. The main difference between the two hosts is the memory capacity.
The Optane Memory, i.e. each of the systems, can be configured in two (three) modes:
By default: apass1 is configured in Memory Mode while apass2 is configured in AppDirect Mode. If you need a different configuration contact S. Christgau Steffen Christgau (Unlicensed) . The mount points for the persistent memory are usually /mnt/pmemX
. X often matches the NUMA domain of the socket/processor the memory is attached to. To be sure run lstopo
from the hwloc environment module. Not every pmem device might be mounted or accessible if a system is in AppDirect mode because other software (DAOS, e.g.) may exclusively grab a device. Check the output of mount to find mount points of /dev/pmemX.
The login message (message of the day) displays the mode in which the system is currently running in. You can also check the CurrentVolatileMode property in the /var/run/optane/state
file. As a further simple check for the given mode, you can run free -h
. If the total memory capacity is around or larger than 3 TB the system is in memory mode. Further, if /dev/pmem[01]
exists, the AppDirect (or Mixed/Hybrid) mode is in effect.
CPU | 2x Intel Xeon Platinum 8260L (24c, 2,4 GHz) Cascade Lake SP |
System | Inspur NF5280M5 |
Memory | apass1:
apass2:
All DIMM slots fully populated with Optane/DRAM pairs (2:2:2 configuration). The Optane DIMMs are interleaved and a single region spans over them (per socket) |
Storage | apass1:
apass2:
|
Network | Single Port Omni-Path HFI Adapter 100 Series (back-to-back connected via Cu cable) |
Pic 1: Server Board Layout
Hostname: aurora
CPU | 2x Intel Xeon Gold 6126 (12c, 2,6 GHz) Skylake |
Memory | 192 GB (DDR4-2666 ESS RDIMM) |
Accelerators | 8x NEC Vector Engines 1.0 (VE) Modell B |
VE Configuration | per VE:
|
Network | 2x 100 Gb/s IB between the two PCI root complexes |
VE OS: 2.4.3
Pic 2: Aurora Server with 8 VE's
Hostname: cpl
Cooper Lake is Intel's codename for the third-generation of their Xeon scalable processors, developed as the successor to Cascade Lake.
Improvements:
CPU | 4x Intel Xeon Platinum 8353H (18c, 2,5GHz) CooperLake |
Memory | 384 GB (DDR4-3200 RDIMM) |
Storage | 18 TB NVMe Raid local scratch (/local) |
Network | 2x 10 Gb/s Ethernet |
Hostname: icl
CPU | 2x Intel Xeon Platinum 8360Y (36c, 2,4 GHz) IceLake |
Memory | 512 GB (DDR4-3200 RDIMM) |
Storage | 18 TB NVMe Raid local scratch (/local) |
Network | 2x 10 Gb/s Ethernet |
Pic 3: Server Board Layout
Hostname: maverick{1,2}
One dataflow engine hardware accelerator card per server (accelerator cards provided by ParTec).
CPU | 2x AMD EPYC 7513 (32c, 2,6 GHz) |
Memory | 256 GB (DDR4-3200 RDIMM) |
Storage | 1,6 TB NVMe SSD |
Network | 1 Gb/s Ethernet |
Pic 4: Server Board Layout
To gain access to the Next-Generation Technology Pool, contact support@nhr.zib.de. Please give a short description of your intention and the system you intend to use.
Use Slurm on the NGT login node to access individual NGT systems.
The NGT login node is "login-ngt", reachable using ssh via our public login nodes "blogin.nhr.zib.de" (replace USERNAME with your NHR@ZIB account name):
$ ssh -J USERNAME@blogin.nhr.zib.de USERNAME@login-ngt |
Make use of the ssh-agent to avoid repeated prompts for the passphrase (ALL keys used to access the NHR@ZIB must have a passphrase). Run "ssh-agent" to start the agent and load your default key. Or, if your ssh key is in ~/.ssh/id_rsa_nhr, run:
$ ssh-add ~/.ssh/id_rsa_nhr |
With a suitable ssh config, you can jump to the NGT login node using one simple command:
$ ssh login-ngt |
The ssh config in ~/.ssh/config looks like this (replace USERNAME with your NHR@ZIB account name):
Host login-ngt ProxyJump %r@blogin.nhr.zib.de Hostname login-ngt User USERNAME IdentityFile ~/.ssh/id_rsa_nhr |
Unused compute nodes are shut down. Slurm will start nodes when needed. Depending on the node this takes 2..5 minutes.
Use sinfo to query the node status. In the following example, "icl" is up an running, "cpl" is powered down to save energy (indicated by the "~" mark at the end):
login$ sinfo -N NODELIST NODES PARTITION STATE icl 1 icl idle cpl 1 cpl idle~ |
To start an interactive session on a compute node, use srun.
login$ srun --pty -picl bash -ls icl$ |
Alternatively, you can use "salloc" to start and allocate a node:
login$ salloc -picl |
When a node is up, direct ssh access is still possible, but needs login-ngt as jump host. An example ssh-config (for node "cpl") is:
Host cpl-ngt ProxyJump %r@blogin.nhr.zib.de,%r@login-ngt Hostname cpl.ngt.nhr.zib.de User USERNAME IdentityFile ~/.ssh/id_rsa_nhr |