Versionen im Vergleich

Schlüssel

  • Diese Zeile wurde hinzugefügt.
  • Diese Zeile wurde entfernt.
  • Formatierung wurde geändert.

Since WORK is a Lustre filesystem with metadata servers and object storage servers managing the user data. WORK is a single, shared distributed resource using a variety of IO-servers and hundreds of storage devices in parallel the ressource have to be used fairly.Especially hundreds resource, and as such some user jobs may impact with their I/O patterns the performance of other jobs running concurrently using the WORK resources. We want to achieve that this resource is used fairly across the wide spectrum of applications and users.

If jobs request hundreds of thousands metadata operations like open, close and stat per job, this can cause a "slow" filesystem (unresponsiveness) even when the metadata are stored on SSDs in our Lustre configuration.

Therefore, we provide here some general advice to avoid critical conditions of WORK including unresponsiveness:

  • Write intermediate results and checkpoints as seldom as possible.
  • Try to use large IO sizes write/read larger data volumes (>1 MiB) and to arrange your IO as sequential as possible. Work is harddisk based.reduce the number of files concurrently managed in WORK.
  • For inter-process communication use proper protocols (e.g. MPI) instead of files in WORK.
  • To If you want to control your jobs from the outside you can externally, consider to use POSIX signals, which can be send instead of using files frequently opened/read/closed by your program. You can send signals e.g. to batch jobs via "scancel --signal..."
  • Use MPI-IO to coordinate your IOI/O instead of each MPI task doing individual POSIX IO I/O (HDF5 and netCDF make help you with this).OPENFOAM


For some of the codes we are aware of certain issues:

If you have questions or you are unsure regarding your individual scenario, please get in contact with your consultant.