Slurm

Aus PC2 Doc
Wechseln zu: Navigation, Suche

Workload Manager - SLURM

The Noctua1 system uses the Simple Linux Utility for Resource Management (SLURM)

  • plenty of documentation can be found on https://slurm.schedmd.com/documentation.html
  • daily work commands are
    • sbatch - submit a batch script to SLURM
    • srun - run parallel job
    • scancel - signal or cancel job under the control of SLURM
    • squeue - information about running jobs
    • sinfo - info about the partitions and nodes
    • scluster - info about currently allocated, free and offlined nodes
  • the entire information about your program execution is contained in a batch script which is submitted via sbatch
  • the batch script contains one or more parallel jobs runs executed via srun (job step).

SLURM Configuration on Noctua

Changelog

  • 29.05.2019: we change the monthly windows to sliding windows of 30 days

Basic Configuration

  • Slurm runs in private node:
    • squeue you will only show your own jobs. Please use scluster if you are interested in the number of currently free, used drained or offline compute nodes.
    • Accounting information (sacct) is only returned for your own jobs or for jobs of projects that you are a slurm coordinator of.
  • Nodes are assigned exclusively to a user (no node sharing). That is, you can only request complete nodes (with two cpus and 20 cores per cpu). Even if you use only one cpu-core of the nodes, your project is billed all 40 cpu-cores.
  • Hyperthreading is disabled on the compute nodes, i.e. the 40 cpu-cores per node are physical cpu-cores without simultaneous multithreading.
  • Requeuing, rebooting of nodes (except FPGA nodes), use of default accounts and submission to multiple partitions are not allowed.
  • The accounting is based on allocated cpu-core hours. Allocating one node for one hour costs your project 40 cpu-core hours. FPGA-hours are presently not counted.
  • Backfilling is enabled. That is, if your job has a lower priority than someone else's job, your job can start if it would not delay the start point of the other persons job. That means, your can fill in the gaps left by larger jobs. Hence, it is important to set a realistic time limit for your job.
  • You can submit and start jobs even if you have exceeded your 30-day contingent or your 90-day contingent or even your total contingent. The only change from exceeding these contingents is a reduction of priority.

Partitions

partition name max. walltime default walltime max. number of nodes default number of nodes nodes purpose priority_factor
all 12 hours (12:00:00) 4 hours (04:00:00) 272 1 cn-[0001-0256],fpga-[0001-0016] system-benchmarking 1.0
batch (default) 12 hours (12:00:00) 4 hours (04:00:00) 128 1 cn-[0001-0251] normal jobs 0.0
long 21 days (21-00:00:00) 4 hours (04:00:00) 64 1 cn-[0001-0251] long-running jobs 0.0
short 30 min. (00:30:00) 5 min. (00:05:00) 2 1 cn-[0002-0256] compile / test jobs 0.0
test 30 min. (00:30:00) 5 min. (00:05:00) 64 1 cn-[0001-0251] sniff user 0.0
fpga 2 hours (02:00:00) 20 min. (00:20:00) 16 1 fpga-[0001-0016] FPGA jobs 0.0

You can list the partition you are allowed to use with pc2status. Slurm project coordinators can set additional limits for individual users in their project.

Priorities

The scheduling of jobs on Noctua is based on job priorities. The priority is calculated as:

  Job_priority = 400.000 * (QoS_factor) + 50.000 * (partition_factor) + 35.000 * (age_factor) + 15.000 * (job_size_factor)

The QoS_factor is detailed below. the partition_factor is listed in the above table. The age_factor starts at zero when a job is submitted and grows linearly to reach a value of one after 10 days. The job_size_factor favors large jobs.

  • The used cpu-core hours of your project in the last 30 days relative to a 30-day contingent.
  • There are two contingents:
    • 30-day contingent M30=(total contingent)/(days in project runtime)*30
    • 90-day contingent-window M90=(M30)+(half of the leftover M30-contingent of a reference day 30 days ago)+(half of the M30-contingent of the next 30 days). The used contingent is 'not subtracted from your next M30-contingent.

Qos_factor rules:

  • used cpu-hours lower than M30: QoS_factor=0.75 (normal)
  • used cpu-hours arger than M30 but lower than M90: QoS_factor=0.5 (lowcont)
  • used cpu-hours larger than M90: QoS_factor=0.25 (nocont)
  • No contingent (M30=0, M90=0): QoS_factor=0 (suspended). Project is suspended and no jobs can be submitted.

You can use pc2status to view the current contigents and QoS-level of your projects. sprio gives you the computed priorities of your pending jobs.

Submitting Jobs

Jobs are submitted with the command line tool sbatch. The main command line argument of sbatch is a shell script that describes the parameters of the job and the instructions to be executed. For example for a

  • four-node job
  • with the job name "cp2k_water_128",
  • the account "hpc-prf-ldft",
  • the partition "batch",
  • a time limit of 10 hours (10:00:00),
  • and mail notification to test@example.com, you could use:
#!/bin/bash
#SBATCH -N 4
#SBATCH -J cp2k_water_128
#SBATCH -A hpc-prf-ldft
#SBATCH -p batch
#SBATCH -t 10:00:00
#SBATCH --mail-type all
#SBATCH --mail-user test@example.com

#run your application here

  • By default, Slurm will use the directory in which sbatch was executes as the working directory.
  • Many more options of sbatch can be found in the sbatch mamual.
  • more details on controlling the task mapping on the hardware can be found in Noctua-Tuning.


Monitoring SLURM Jobs

  • While a job is waiting to start, you can use spredict to see its estimated start time.
  • While running you can inspect your job with squeue
> squeue -u $USER
  JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  12345   batch         ...  
  • for more information you can use scontrol show job <JOBID> from an interactive session to get the job steps
  • use sacct to get accounting information about active and completed jobs
  • In case of a problem with your job (e.g. no or unexpected output), you can cancel it
scancel <JOBID>

Remark:

For massive file I/O the LUSTRE file system must be used ($PC2PFS/<group>)