PBS resources
Chunk vs job
PBS concept is based on so-called chunks as basic computational units. A chunk is further indivisible set of resources placed on the same host . User may require one (default) or more chunks for a job.
Some resources can be defined per chunk, i.e. user can request them for each chunk separately. Examples of chunk-wide resources are number of CPUs, GPUs or scratch directory type and volume.
Other resources can be defined only for the job as a whole. Examples of job-wide resources are walltime, type of queue or software licence.
Chunks are typically used to optimize the resource usage of parallelized computations. There may be e.g. a driving process placed on one chunk with relatively few CPUs combined with parallelized parts of a job running on other chunk(s) with many CPUs.
Chunks can be on one machine next to each other or always on different machines, depending on job-wide option place
.
Number of chunks N
is set with -l select=[N]:
. When user needs to set up more chunks with different resources, they are concatenated with +
. The select
keyword can be used only once.
N
is not mandatory in the select
statement. Default value of N
is 1.
Examples of chunk specification:
-l select=:ncpus=2 # one chunk with two CPUs
-l select=1:ncpus=2 # same as above
-l select=1:ncpus=1+1:ncpus=2 # two chunks, one with one CPU and one with two CPUs
-l select=6:ncpus=2:mem=4gb+3:ncpus=8:mem=4GB # six chunks with 2 CPUs and 4 GB of memory each and three chunks with 8 CPUs and 4 GB of memory each
Examples of place
usage:
-l select=2:ncpus=1 -l place=pack # two chunks, both must be on one host
-l select=2:ncpus=1 -l place=scatter # two chunks which must be on different hosts
-l select=2:ncpus=1 -l place=free # two chunks placed arbitrarily according to resource availability on (default)
If you are not sure about the number of needed processors, ask for an exclusive reservation of the whole machine:
-l select=2:ncpus=1 -l place=exclhost # request whole host allocated to this job (without cpu and mem limit control)
-l select=102:place=group=cluster # 102 cpus on one cluster
Chunk-wide resources
CPUs
Resource name: ncpus
. Default value: 1
.
Example:
-l select=1:ncpus=2 # request 2 CPUs
Memory
Resource name: mem
. Default value: 400 MB
.
Example:
-l select=1:ncpus=1:mem=10gb # request 10 GB memory job
CPU type
Resource name: cpu_vendor
.
In the Metacentrum grid there are both machines with AMD as well as Intel processors. Some software may be sensitive to the processor used (although most applications run seamlessly on both). Therefore you can request a specific CPU vendor:
qsub -l select=1:ncpus=1:cpu_vendor=amd # use machine with AMD processor
qsub -l select=1:ncpus=1:cpu_vendor=intel # dtto with Intel
CPU speed
Resource name: spec
.
CPUs across Metacentrum grid differ in their how fast they are. Therefore they are classed by parameter spec
according to methodology of SPEC CPU2017. To see the spec
values, go to qsub assembler and see the drop-down menu in the spec
parameter.
Todo
Jak nastavit odkaz tak, aby proklik sel na EN verzi qsub assembleru?
Example:
qsub -l select=:spec=4.8 # 1 CPU with speed class 4.8 or higher
Scratch directory
Resource names: scratch_local
, scratch_shared
, scratch_ssd
, scratch_shm
.
Default value: none.
We offer four types of scratch storage:
scratch_local
- available on every node, located on regular hard disc
- choose this type as a default if you have no reason to do otherwise
- located in
/scratch/USERNAME/job_JOBID
scratch_ssd
- located on small SSD disc
- ultra fast (compared to local scratch), but smaller in volume
- not available on all computational nodes!
- to check for availability on a particular node: see Node properties on Metavo pages
- recommended in jobs where the bottleneck is disc-related operations (applications that create/read a lot of files)
- located in
/scratch.ssd/USERNAME/job_JOBID
scratch_shared
- network volume which is shared between all clusters in a given location (city)
- read/write operation slower than on local scratch
- useful if you need to run more than one application that need access to the same data
- not available on all computational nodes!
- to check for availability on a particular node: see Node properties on Metavo pages
- mounted to directory
/scratch.shared/USERNAME/job_JOBID
scratch_shm
- scratch directory is in RAM
- fastest, but data on scratch do not survive the end/failure of the job
- use when you need ultra fast scratch AND when you absolutely don't care about data from failed/killed/ended jobs
- boolean type, submitted as
scratch_shm=true
- maximum size of scratch is defined by mem (memory) parameter
- remember to choose mem large enough (to hold both data in scratch and the actual memory requirements for the job)
- mounted to directory
/dev/shm/scratch.shm/USERNAME/job_JOBID
Scratch directory is not writable, only it's content is. Therefore you cannot clean the scratch by rm -rf $SCRATCHDIR
, but you can do it by rm -rf $SCRATCHDIR/*
. Users should always clear the content of scratch directory after the job ends to free disc space. Otherwise, this directory will be automatically deleted after approx. 14 days (earlier if there is lack of space on disks).
The most important system variables related to scratch are these:
SCRATCHDIR
: path to scratch directorySCRATCH_TYPE
: type of scratch directorySCRATCH_VOLUME
: size of scratch directory
Example:
(BUSTER)user123@skirit:~$ qsub -I -l select=scratch_local=20gb -l walltime=1:00:00
qsub: waiting for job 12910867.pbs-m1.metacentrum.cz to start
qsub: job 12910867.pbs-m1.metacentrum.cz ready
(BULLSEYE)user123@elmo3-3:~$ echo $SCRATCHDIR
/scratch/user123/job_12910867.pbs-m1.metacentrum.cz
(BULLSEYE)user123@elmo3-3:~$ echo $SCRATCH_TYPE
local
(BULLSEYE)user123@elmo3-3:~$ echo $SCRATCH_VOLUME
21474836480
Number of GPUs
Resource name: ngpus
. Default value: 0
.
Todo
sem dat vic ke gpu vypoctum, do jake fronty to posilat apod.
-l select=ncpus=1:ngpus=2 -q gpu
OS
Resource name: os
, osfamily
.
To submit a job to a machine with specific operation system, use os=OS_name
:
zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:os=debian11 …
To submit a job to a machine with a specific OS type, use osfamily=OS_type_name
zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:osfamily=debian …
Cluster
Resource name: cluster
, cl_NAME
. Default value: none.
PBS allows you to choose a particular cluster (using either resource cluster
or cl_NAME
:
qsub -l select=1:ncpus=2:cluster=halmir # run the job on cluster "halmir"
qsub -l select=1:ncpus=2:cl_halmir=True # same as above
Alternatively, you can avoid a particular cluster:
qsub -l select=1:ncpus=2:cluster=^halmir
qsub -l select=1:ncpus=2:cl_halmir=False
However it is not possible to combine conditions. If you e.g. want to avoid both adan
and halmir
, the following
qsub -l select=1:ncpus=2:cluster=^adan:cluster=^halmir
will not work. This is based on the principle that in PBS, every resource (in this case cluster
resource) can be specified only once.
On the other hand, cl_adan
and cl_halmir
are different resources, so:
qsub -l select=1:ncpus=2:cl_adan=False:cl_halmir=False
will work and will avoid both adan
and halmir
clusters.
The same can be done with
qsub -l select=1:ncpus=2:cluster=^adan:cl_halmir=False
Location
Resource names: brno
, budejovice
, liberec
, olomouc
, plzen
, praha
, pruhonice
, vestec
.
Default value: none.
As the physical machines are distributed over multiple locations in Czech republic, it may be useful to be ble to specify the location of the machine(s)
qsub -l select=1:ncpus=1:brno=True # run on machines located in Brno.
MPI processes
Resource name: mpiprocs
, ompthreads
.
How many MPI processes would run on one chunk is specified by mpiprocs=[number]
:
-l select=3:ncpus=2:mpiprocs=2 # 6 MPI processes (nodefile contains 6 lines with names of vnodes), 2 MPI processes always share 1 vnode with 2 CPU
How many OpenMP threads would run in 1 chunk ompthreads=[number]
, 2 omp threads on 1 chunks is default behaviour (ompthreads = ncpus
).
Job-wide resources
Duration
Resource name: walltime
. Default value: 24:00:00
(24 hours).
Maximal duration of a job is set in format hh:mm:ss
.
Example:
-l walltime=1:00:00 # one hour job
Users can to a certain extent prolong walltime in running jobs - see qextend
command
Queue and/or PBS server
Resource name: . Default value:
.
If you need to send the job to a specific queue and/or specific PBS server, use the qsub -q
option.
qsub -q queue@server # specific queue on specific server
qsub -q queue # specific queue on the current (default) server
qsub -q @server # default queue on specific server
Licence
Some software requires licence to run. Licence is set by parameter -l
-l select=3:ncpus=1 -l walltime=1:00:00 -l matlab=1 # one licence for Matlab
Paths for output
By default the job output (output, and error files) is saved in a folder from which the job was submitted (variable PBS_O_WORKDIR
).
This behaviour for output, resp. error files can be changed by parameters -o, resp -e.
-o /custom-path/myOutputFile
-e /custom-path/myErrorFile