Unspecific error messages when reading huge input files

Problem

In a job that requires "staging" of new huge input files (8GB in 650 files) during runtime, the job fails with error messages like "invalid file format". Inspecting the files later, does not reveal any errors and the input files are sane

cp repository/* input_area
mpirun ...

It seems to be a lustre cache related problem, the startup of the parallel process is faster than lustre can sychronise itself on all nodes.

Solution

Add some delay after copying large file sets:

cp repository/* input_area
sleep 20
mpirun ...
sleep 20

Alternatively, the tool nocache serves as a workaround for this issue (thanks John):

nocache cp repository/* input_area
mpirun ...

Problem

In a job that requires "staging" of new huge input files (8GB in 650 files) during runtime, the job fails with error messages like "invalid file format". Inspecting the files later, does not reveal any errors and the input files are sane

cp repository/* input_area
mpirun ...

It seems to be a lustre cache related problem, the startup of the parallel process is faster than lustre can sychronise itself on all nodes.

Solution

Add some delay after copying large file sets:

cp repository/* input_area
sleep 20
mpirun ...
sleep 20

Alternatively, the tool nocache serves as a workaround for this issue (thanks John):

nocache cp repository/* input_area
mpirun ...

Unspecific error messages when reading huge input files

Problem

Solution

Related articles

Problem

Solution

Related articles