Metadata Usage on WORK

WORK is a Lustre filesystem with metadata servers and object storage servers managing the user data. WORK is a single, shared resource, and as such some user jobs may impact with their I/O patterns the performance of other jobs running concurrently using the WORK resources. We want to achieve that this resource is used fairly across the wide spectrum of applications and users.

If jobs request hundreds of thousands metadata operations like open, close and stat per job, this can cause a "slow" filesystem (unresponsiveness) even when the metadata are stored on SSDs in our Lustre configuration.

Therefore, we provide here some general advice to avoid critical conditions of WORK including unresponsiveness:

  • Write intermediate results and checkpoints as seldom as possible.
  • Try to write/read larger data volumes (>1 MiB) and reduce the number of files concurrently managed in WORK.
  • For inter-process communication use proper protocols (e.g. MPI) instead of files in WORK.
  • If you want to control your jobs externally, consider to use POSIX signals, instead of using files frequently opened/read/closed by your program. You can send signals e.g. to batch jobs via "scancel --signal..."
  • Use MPI-IO to coordinate your I/O instead of each MPI task doing individual POSIX I/O (HDF5 and netCDF may help you with this).
  • Instead of using resursive chmod/chown/chgrp, please use as combination of lfs find and xargs, e.g. lfs find /path/to/folder|xargs chgrp $project, as this creates less stress on the metadataservers and is much faster

Analysis of meta data

An existing application can be investigated with respect to meta data usage. Let us assume an example job script for the parallel application myexample.bin with 16 MPI tasks.

Example job script
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --partition=standard96

srun ./myexample.bin

Once you add the linux command strace to the job you create two files per linux process (MPI task). For this example 32 trace files are created. Large MPI jobs can create a huge number of trace files, e.g. a 128 node job with 128 x 96 MPI tasks created 24576 files. That is why we strongly recommend to reduce the MPI task number as far as possible. 

Job script with strace
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --time=01:00:00
#SBATCH --partition=standard96

srun strace -ff -t -o trace -e open,openat ./myexample.bin

Analysing one trace file shows all file open activity of one process (MPI task).

Trace file analysis
> ls -l trace.*
-rw-r----- 1 bzfbml bzfbml 21741 Mar 10 13:10 trace.445215
...
> wc -l trace.445215
258 trace.445215
> cat trace.445215
13:10:37 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
13:10:37 open("/lib64/libfabric.so.1", O_RDONLY|O_CLOEXEC) = 3
...
13:10:38 open("/scratch/usr/bzfbml/mpiio_zxyblock.dat", O_RDWR) = 8

For the interpretation of the trace file you need to expect a number of open entries originating from the linux system independently from your code. The example code  myexample.bin creates only one file with the name mpiio_zxyblock.dat.  258 open statements in the trace file include only one open from the application which indicates a very desirable meta data activity.

Known issues

For some of the codes we are aware of certain issues:

If you have questions or you are unsure regarding your individual scenario, please get in contact with your consultant.

Best practices using WORK as a lustre filesystem: https://www.nas.nasa.gov/hecc/support/kb/lustre-best-practices_226.html