Workload Manager - SLURM
The Noctua1 system uses the Simple Linux Utility for Resource Management (SLURM)
- plenty of documentation can be found on https://slurm.schedmd.com/documentation.html
- daily work commands are
- sbatch - submit a batch script to SLURM
- srun - run parallel job
- scancel - signal or cancel job under the control of SLURM
- squeue - information about running jobs
- sinfo - info about the partitions and nodes
- scluster - info about currently allocated, free and offlined nodes
- the entire information about your program execution is contained in a batch script which is submitted via sbatch
- the batch script contains one or more parallel jobs runs executed via srun (job step).
SLURM Configuration on Noctua
- 29.05.2019: we change the monthly windows to sliding windows of 30 days
- Slurm runs in private node:
- squeue you will only show your own jobs. Please use scluster if you are interested in the number of currently free, used drained or offline compute nodes.
- Accounting information (sacct) is only returned for your own jobs or for jobs of projects that you are a slurm coordinator of.
- Nodes are assigned exclusively to a user (no node sharing) except for fpgasyn (see below). That is, you can only request complete nodes (with two cpus and 20 cores per cpu, 40 cores total). Even if you use only one cpu-core of the nodes, your project is billed all 40 cpu-cores.
- Hyperthreading is disabled on the compute nodes, i.e. the 40 cpu-cores per node are physical cpu-cores without simultaneous multithreading.
- Requeuing, rebooting of nodes (except FPGA nodes), use of default accounts and submission to multiple partitions are not allowed.
- The accounting is based on allocated cpu-core hours. Allocating one node for one hour costs your project 40 cpu-core hours. FPGA-hours are presently not counted.
- Backfilling is enabled. That is, if your job has a lower priority than someone else's job, your job can start if it would not delay the start point of the other persons job. That means, your can fill in the gaps left by larger jobs. Hence, it is important to set a realistic time limit for your job.
- You can submit and start jobs even if you have exceeded your 30-day contingent or your 90-day contingent or even your total contingent. The only change from exceeding these contingents is a reduction of priority.
|partition name||max. walltime||default walltime||max. number of nodes||default number of nodes||nodes||purpose||partition_factor|
|all||12 hours (12:00:00)||4 hours (04:00:00)||272||1||cn-[0001-0256],fpga-[0001-0016]||system-benchmarking||1.0|
|batch (default)||12 hours (12:00:00)||4 hours (04:00:00)||128||1||cn-[0001-0251]||normal jobs||0.0|
|long||21 days (21-00:00:00)||4 hours (04:00:00)||64||1||cn-[0001-0251]||long-running jobs||0.0|
|short||30 min. (00:30:00)||5 min. (00:05:00)||2||1||cn-[0252-0256],fpga-[0001-0016]||compile / test jobs||0.0|
|test||30 min. (00:30:00)||5 min. (00:05:00)||64||1||cn-[0001-0251]||sniff user||0.0|
|fpga||2 hours (02:00:00)||20 min. (00:20:00)||16||1||fpga-[0001-0016]||FPGA jobs||1.0|
|fpgasyn||3 days (3-00:00:00)||1 day (1-00:00:00)||1||1||cn-[0254-0256]||FPGA synthesis jobs, shared nodes see below||1.0|
You can list the partition you are allowed to use with pc2status. Slurm project coordinators can set additional limits for individual users in their project.
Since March 2020 the fpgasyn-partition uses a different mode: To improve throughput of FPGA synthesis jobs the default mode now is to share a node between two jobs. Each jobs gets 91 GB of main memory by default but can use all 40 cpu cores. If your syntesis needs less/more main memory, you have to add the option --mem= to request a specific value for the main memory. For example,
sbatch --mem=60000MB -p fpgasyn test.sh
requests 60GB of main memory, hence up to three jobs of this kind can run in the fpgasyn-partiion on one shared node. If you want to use a node exclusively (and all its main memory), please use
sbatch -p fpgasyn --exclusive test.sh
Additionally, for jobs from the fpgasyn-partition the non-AVX single-and-two-core turbo clock of 3.7 GHz is now available to speed up synthesis.
The scheduling of jobs on Noctua is based on job priorities. The priority is calculated as:
Job_priority = 400.000 * (QoS_factor) + 50.000 * (partition_factor) + 35.000 * (age_factor) + 15.000 * (job_size_factor)
The QoS_factor is detailed below. the partition_factor is listed in the above table. The age_factor starts at zero when a job is submitted and grows linearly to reach a value of one after 10 days. The job_size_factor favors large jobs.
- The used and allocated cpu-core hours of your project in the last 30 days relative to a 30-day contingent. Allocated cpu-core hours is the sum of the future runtime of your currently running jobs assuming that they run till their time limit.
- There are two contingents:
- 30-day contingent M30=(total contingent)/(days in project runtime)*30
- 90-day contingent-window M90=(M30)+(half of the leftover M30-contingent of a reference day 30 days ago)+(half of the M30-contingent of the next 30 days). The used contingent is 'not subtracted from your next M30-contingent.
- used and allocated cpu-hours lower than M30: QoS_factor=0.75 (normal)
- used and allocated cpu-hours arger than M30 but lower than M90: QoS_factor=0.5 (lowcont)
- used and allocated cpu-hours larger than M90: QoS_factor=0.25 (nocont)
- No contingent (M30=0, M90=0): QoS_factor=0 (suspended). Project is suspended and no jobs can be submitted.
You can use pc2status to view the current contigents and QoS-level of your projects. sprio gives you the computed priorities of your pending jobs.
Jobs are submitted with the command line tool sbatch. The main command line argument of sbatch is a shell script that describes the parameters of the job and the instructions to be executed. For example for a
- four-node job
- with the job name "cp2k_water_128",
- the account "hpc-prf-ldft",
- the partition "batch",
- a time limit of 10 hours (10:00:00),
- and mail notification to firstname.lastname@example.org, you could use:
#!/bin/bash #SBATCH -N 4 #SBATCH -J cp2k_water_128 #SBATCH -A hpc-prf-ldft #SBATCH -p batch #SBATCH -t 10:00:00 #SBATCH --mail-type all #SBATCH --mail-user email@example.com #run your application here
- This jobscript can be submitted with the command "sbatch" followed by the filename of the jobscript.
- By default, Slurm will use the directory in which sbatch was executed as the working directory.
- Many more options of sbatch can be found in the sbatch mamual.
- more details on controlling the task mapping on the hardware can be found in Noctua-Tuning.
Monitoring SLURM Jobs
- While a job is waiting to start, you can use spredict to see its estimated start time.
- While running you can inspect your job with squeue
> squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345 batch ...
- for more information you can use scontrol show job <JOBID> from an interactive session to get the job steps
- use sacct to get accounting information about active and completed jobs
- In case of a problem with your job (e.g. no or unexpected output), you can cancel it
For massive file I/O the LUSTRE file system must be used ($PC2PFS/<group>)