My job failed with something like "No space left on device / Input-output error"
Typically this happens when:
- your job generates large output (.OU) or error (.ER) files into
/var/spool
directory, or - your job generates large temporary file(s) in
/tmp
directory.
Both /var/spool
and /tmp
directory are protected by a filesystem quota. On the affected node, your further jobs on the affected machine will not run until you remove the obtruding files.
How to remove the files:
- login onto the affected machine, e.g.
$ ssh user_123@node_123.metacentrum.cz
- list the files in your filesystem quota:
$ check-local-quota
- inspect the files; if they contain valuable data, copy them to your home directory. After that remove them.
- check local quota again; there should be no files left
How to prevent the situation:
If the files were placed in /tmp directory, add
export TMPDIR=$SCRATCHDIR
to the beginning of your batch script. Some applications use variable TMPDIR
to store temporary files. If the value of TMPDIR
is not defined, the files are stored in system /tmp
directory.
If the problem was caused by large .OU
or .ER
files, redirect them to /dev/null directory to a file in your scratch directory, e.g.
./your_application ..(options, input files etc)... 2> /dev/null # redirect .ER to /dev/null
./your_application ..(options, input files etc)... > $SCRATCHDIR/ # redirect .OU to scratch directory