NVIDIA DGX H100 Usage
Technical Parameters
The NVIDIA DGX H100 system consists of a single compute node, capy.cerit-sc.cz, with the following specifications:
- 112 CPU cores
- 2 TB RAM
- 8× NVIDIA H100 GPU accelerators (each with 80 GB GPU memory)
Access Conditions
Users can apply for computational resources for a 3-month period through the DGX Grant Competition, which is open to research, scientific, and educational organization employees, as well as PhD students.
Applicants must demonstrate:
- The ability to utilize at least 2, preferably 4 or all 8 GPUs
- Effective use of NVLink (Multi-GPU) and large GPU memory
- Research potential and expected results
Approval Process
The application review consists of three steps:
- Submission of a PDF file containing the OpenAccess Application request to meta@cesnet.cz
- Internal review of the request
- Notification of competition results within 5 working days
OpenAccess Application
- OpenAccess Application requests must follow the structure specified below.
- Popular Abstract
- Methods and State-of-the-Art
- Computational Approach, Parallelization, and Scalability
- Requested Computational Resources
- The total length must not exceed 5 pages, including figures and tables.
- Submit a PDF file with the proposal to meta@cesnet.cz.
Popular Abstract
Provide a popular abstract that can be published on a website, in newspapers, or similar platforms. This abstract should describe:
- The proposed research
- The methodologies used
- The expected impact
The abstract should be written in a way that is easily understandable by the general public.
Methods and State-of-the-Art
- Clearly define the research aims and objectives.
- Provide sufficient detail for reviewers to understand the proposal.
- Specify if the research is part of an approved H2020, ERC, EuroHPC, or other peer-reviewed national or international projects.
- Describe the theoretical and computational methods to be used, comparing them with the current state-of-the-art.
- Outline expected results, including planned publications.
Computational Approach, Parallelization, and Scalability
- Describe the computational techniques and platforms to be used.
- Include details on codes, programming languages, libraries, and other software.
- Explain parallelization and scalability, especially in relation to NVLink (proving that your jobs can effectively utilize 4 or 8 GPUs simultaneously).
- If possible, provide references and data on your application’s parallel performance, speedup, and scalability.
Requested Computational Resources
- Justify the requested computational resources.
- Provide an estimate of the required CPU, RAM, and GPU hours in total and also for a typical job.
- Explain how the estimated resources were calculated.
Usage
Once your request is approved, the DGX H100 cluster is accessible only via the gpu_dgx
queue on:
gpu_dgx@pbs-m1.metacentrum.cz
To submit a job to the gpu_dgx queue, use the following command:
qsub -q gpu_dgx@pbs-m1.metacentrum.cz -l select=1:ngpus=4 -l walltime=10:00:00
Example Job Submission
If you require 8 GPUs for 10 hours, request the entire node by adding the following parameters to your qsub
command:
qsub -q gpu_dgx -l select=1:ngpus=8:ncpus=112:mem=2000g:scratch.ssd=1tb -l place=exclhost -l walltime=10:00:00
For additional support, contact meta@cesnet.cz.