My job failed with something like "No space left on device / Input-output error"
Typically this happens when:
- your job generates large output (.OU) or error (.ER) files into
/var/spooldirectory, or - your job generates large temporary file(s) in
/tmpdirectory.
Both /var/spool and /tmp directory are protected by a filesystem quota. On the affected node, your further jobs on the affected machine will not run until you remove the obtruding files.
How to remove the files:
- login onto the affected machine, e.g.
$ ssh user_123@node_123.metacentrum.cz - list the files in your filesystem quota:
$ check-local-quota - inspect the files; if they contain valuable data, copy them to your home directory. After that remove them.
- check local quota again; there should be no files left
How to prevent the situation:
If the files were placed in /tmp directory, add
export TMPDIR=$SCRATCHDIRto the beginning of your batch script. Some applications use variable TMPDIR to store temporary files. If the value of TMPDIR is not defined, the files are stored in system /tmp directory.
If the problem was caused by large .OU or .ER files, redirect them to /dev/null directory to a file in your scratch directory, e.g.
./your_application ..(options, input files etc)... 2> /dev/null # redirect .ER to /dev/null
./your_application ..(options, input files etc)... > $SCRATCHDIR/ # redirect .OU to scratch directoryLast updated on
