Quotas

Keeping large data volumes or too many files in user's homes is problematic, since it significantly increases the time needed to backup the home directories as well as to manipulate them for any other purpose. For the sake of sustainability of system services, a quota on number of files as well as a quota on total volume of data is set on most storages.

You can see the state of your quotas:

in the table that appears every time when you login on a frontend,
at your quota overview in PBSmon.

Find your large data

`ncdu2` tool

ncdu2 command is a tool especially suitable to locate which files occupy most of your quota.

ncdu2 goes through directory structure, collects the data and soerts them according to size / or number of files. It is therefore suitable to check both for large files as well as for directories with huge number of (possibly small) files.

ncdu2 is installed on all MetaCentrum nodes.

Get the database

Basic usage

The ncdu2 tool collects the info in a .json file. For example, command

ncdu2 -x -o output.json /storage/cityXY/home/user_123/

will create an output.json file with information about the content of home directory at sstorage cityXY for user user_123.

Probe from storage, not from a frontend

Since probing the directory structure of /storage is uneffective when done from a frontend, it is faster to run ncdu2 directly on /storage where the directory of interest is located:

ssh storage-cityXY.metacentrum.cz 'ncdu2 -x -o output.json /storage/cityXY/home/user_123/'

If the .json file is large, gzip it up

If youo suspect the resulting .json file will be a large one (typical for number-of-files quota overflow), you can consider compressing it by gzip:

 ssh storage-XY.metacentrum.cz 'ncdu2 -x -o - .' | gzip /storage/cityXY/home/login/output.json.gz

Wrap it in a job

If it seems that the process will take longer, you should wrapt the command to a batch job, e.g.:

#!/bin/bash
#PBS -N ncdu_test
#PBS -l select=1:ncpus=1:mem=4gb:scratch_local=10gb
#PBS -l walltime=2:00:00 

RESDIR="/storage/brno2/home/user123/ncdu2_result"

cd $SCRATCHDIR

ssh storage-brno2.metacentrum.cz 'ncdu2 -x -o files_brno2.json ~/' | gzip > files_brno2.json.gz

cp files_brno2.json.gz $RESDIR/

clean_scratch

Display the database

Onde you have the result file, you can open the ncdu2 pseudo-graphical interface as:

zcat ncdu2.json.gz | ncdu2 -f -   # for .gz file
ncdu2 -f files_brno2.json

You will see something like the following.

By pressing ? the help will be displayed.

List by size

If you need to locate where most of the volume resides, press s.

List by number of files

If you need to locate directories with large number of files, press c, then C.

When you exceed a quota

Delete it

If you produce large amount of data by mistake, remove it either within a single command, e.g.

(BUSTER)user123@tarkil:~$ ssh user123@storage-brno6.metacentrum.cz rm -rf ~/junk_dir

or wrap the command into a batch job to avoid waiting for the command to end:

(BUSTER)user123@tarkil:~$ qsub -l walltime=24:00:00 remove_junk_dir.sh

Pack small files into large chunks

If the data is not junk, pack them them into larger chunks using the tar command either from a command line or from within a batch job:

(BUSTER)user123@tarkil:~$ ssh usr123@storage-brno6.metacentrum.cz tar -cf not_junk_dir.tar ~/not_junk_dir
(BUSTER)user123@tarkil:~$ qsub -l walltime=24:00:00 tar_my_files.sh

If you have enough space on your storage directories, you can keep the packed data there. However we encourage users to archive any finished-project data of permanent value.

If you for some reason need to shift some of your quotas, contact us.

Archive the data

Due both to operational reasons (regular backups of storages) and for safety reasons (storages have weaker backup policy than archives), users should archive any data that are of permanent value to them and may be needed in future.

Archiving data from finished projects also helps to avoid problems with storage quotas.

Move the data to another storage

This is an intermediate solution. The storages quotas are separate, so you can temporarily dump some data to different storage where you have more free space.

Root filesystem quota

What is root filesystem quota

Apart of quota set on storage, there is a separate quota for user's data outside the home directory.

This applies to situations when one of user's processes writes to /tmp directory and (on computational node) when the user's job produces large standard output (.OU) or error (.ER) files in /var/spool directory.

Root filesystem quota is only 1 GB

The root filesystem quota is relatively small. If it is exceeded, an email is sent to the user with instructions what to do. Until the data are deleted, no further calculations will be run on the computational node.

How to clear files filling the quota

Login onto the affected machine.

ssh user123@halmir18.metacentrum.cz
List the files in your filesystem quota using the check-local-quota tool.

check-local-quota
Inspect the files; if they contain valuable data, copy them to your home directory. After that remove them.
Check local quota again; there should be no files left.

How to prevent the situation

Redirect TMPDIR to SCRATCHDIR

A common variable name for a directory where temporary files shall be kept is TMPDIR.

Some software uses a /tmp directory as a default for temporary files. Try adding

export TMPDIR=$SCRATCHDIR

to the beginning of your batch script. This will force the application to place the temporary files into scratch directory instead.

Dump/redirect large outputs

If the problem was caused by large .OU or .ER files, either redirect them to /dev/null directory

./your_application ... > /dev/null # redirect .OU to /dev/null
./your_application ... 2> /dev/null # redirect .ER to /dev/null
./your_application ... > /dev/null 2>&1 # redirect both .OU and .ER to /dev/null

or redirect them to a file in your scratch directory

./your_application ... > standard_output.txt # redirect .OU to standard_output.txt
./your_application ... 2> error_output.txt # redirect .ER to error_output.txt
./your_application ... > std_err_output.txt 2>&1 # redirect both .OU and .ER to std_err_output.txt

If you redirect to /dev/null, the data will be dumped and there is no way to get them back later. The second way will make possible to inspect the files after your calculation is done.

The above mentioned causes are the most common ones. Your filesystem quota can be exceeded also in other ways. If you are not sure what caused the problem and how to prevent the situation to happen again, feel free to contact us.