OCuLUS-to-Noctua1

Aus PC2 Doc
Dies ist die bestätigte sowie die neueste Version dieser Seite.
Zur Navigation springen Zur Suche springen

OCuLUS is now more than 8 years in operation and, as you may know, PC² is procuring a new cluster system (Noctua phase-2). As a consequence, Noctua will be renamed to Noctua1 and will replace OCuLUS as Tier3 system. Hence, it is time to say good-bye to OCuLUS and hello to Noctua1. The migration itself starts begin of Q2/2022. See also the Roadmap section below. This document should help to migrate your project workflow to run on Noctua1. It documents the changes and steps to consider.

If you plan to migrate your project (completely or in parts) to Noctua2, please be patient, we are currently working on a related documentation, like this one.

Updates to the Migration Procedures[Bearbeiten]

25.11.21[Bearbeiten]

  • We will provide migration workshops in February 2022

22.10.21[Bearbeiten]

  • If you plan to migrate your accepted or already running OCuLUS project (completely or in parts) to Noctua2 within the remaining time of the project, you may send us an informal application. After the end of the project runtime, you have to apply as usual for a follow-up or new project.

OCuLUS vs Noctua1[Bearbeiten]

In the following table, we summarized the major differences between OCuLUS and Noctua1

Item OCuLUS Noctua1, previously Noctua
Inauguration 2013 2018
Peak Performance 220 TFlop/s 700 TFlop/s
Status 2022 switched off Tier3 system
Compute Nodes 612: 2x Intel Xeon E5-2670, 16 cores
512: 64 GiB RAM
20: 256 GiB RAM
4: 1TB RAM
52: 64 GiB RAM + GPUs
274: 2x Intel Xeon Gold “Skylake” 6148 (2.4GHz, 40 cores)
256: 192 GiB RAM
16: 192 GiB RAM with NVidia GTX/RTX
4: 386 GiB RAM with NVidia GTX/RTX
Number of CPU cores 9,920 10,880
Interconnect Infiniband QDR 40Gbs OmniPath 100Gbs
Storage 480TB BeeGFS 720TB Lustre
Frontends 2 2
WLM OpenCCS SLURM
Until now, the Noctua1 nodes are scheduled in exclusive mode (only one job is able to use a single node).

This will be changed to shared mode, allowing several jobs to run on a single node like on OCuLUS. Exclusive node access will still be possible.

Operating System Scientific Linux 7.2 RHEL-8
Jump-Host None Access to Noctua1 is done by a jump host, acting as a load-balancer. Refer to: Jumphost
PC2PFS Gateway nodes None The Noctua1 parallel file system is exported via NFS or CIFS. Refer to: Gateways

Migrating Procedures[Bearbeiten]

We are happy to assist you. So do not hesitate to send an Email to pc2-support@uni-paderborn.de if you need help. We then will make an appointment to discuss your project specific migration problems. The following issues should be considered.

Self-Compiled SW[Bearbeiten]

We recommend to re-compile your SW on Noctua1 to achieve full performance on the Intel Xeon Gold 6148 CPUs.

Commercial-SW Packages and Executables[Bearbeiten]

We recommend to re-install your SW on Noctua1 to achieve full performance on the Intel Xeon Gold 6148 CPUs.

OCuLUS $PC2PFS data (aka /scratch)[Bearbeiten]

For a specific amount of time (2-3months), we will mount the OCuLUS parallel file system on the Noctua1 frontends in read only mode. This allows you to copy all necessary data from OCuLUS ($PC2PFS) to Noctua1 ($PC2PFS).

$PC2SCRATCH data[Bearbeiten]

Since Noctua1 and Noctua2 both export their $PC2PFS file systems, there is no need for the $PC2SCRATCH file system anymore. For a specific amount of time (2-3months), we will mount the $PC2SCRATCH file system on the Noctua1 frontends in read only mode. This allows you to copy all necessary data to Noctua1 $PC2PFS

Migrating OpenCCS Job Scripts to Slurm Job Scripts[Bearbeiten]

On OCuLUS, we provide the script ccs2slurm which converts OpenCCS job scripts to Slurm job scripts. Try ccs2slurm -h for more detailed information on how to use this script.

SLURM[Bearbeiten]

The following explanations and examples may be helpful while migrating your workflows from OpenCCS to Slurm.

Initial Slurm Specific Remarks[Bearbeiten]

  • First line of a job script in Slurm must be #! <shell> . For example: #! /bin/sh.
  • Slurm jobs start in the submission directory rather than $HOME.
  • Slurm jobs have stdout and stderr output log files combined by default.
  • If the submission script is not a file in the current directory, Slurm will search in your $PATH.
  • Jobs inherit all your current environment variables unless you specify --export=NONE.
  • For more information about the job's environment set by Slurm, see “OUTPUT ENVIRONMENT VARIABLES” in the sbatch man page.
  • Slurm can send Emails when your job reaches certain percentage of walltime limit:
$ sbatch --mail-type=TIME_LIMIT_90 myjob.sub
$ sbatch --mail-type=TIME_LIMIT_80 myjob.sub
$ sbatch --mail-type=TIME_LIMIT_50 myjob.sub
  • You can also use srun to run MPI programs for more control, but you will need to tell Slurm which type of MPI to use (--mpi=openmpi)

Terms[Bearbeiten]

  • Task
    • In the Slurm context, a task is to be understood as a process. So a multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs.
    • Tasks are requested/created with the --ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option.
    • Tasks cannot be split across several compute nodes, so requesting several CPUs with the --cpus-per-task option will ensure all CPUs are allocated on the same compute node.
    • By contrast, requesting the same amount of CPUs with the --ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.
      • If you want to separate output by task, you will need to build a script containing this specification. For example:
    $ cat test
    #!/bin/sh
    echo begin_test
    srun -o out_%j_%t hostname
    $ sbatch -n7 -o out_%j test

  sbatch: Submitted batch job 65541
    $ ls -l out*
    -rw-rw-r--  1 jette jette 11 Jun 15 09:15 out_65541
    -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_0
    -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_1
    -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_2
    -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_3
    -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_4
    -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_5
    -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_6    
  • Job Step
    • A set of (possibly parallel) tasks within a job.
    • Job steps will use allocated nodes that are not already allocated to other job steps. This essential provides a second level of resource management within the job for the job steps.
    • Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation.
    • Instead of submitting many single-node jobs, known as farming, it is suggested to do farming using job-steps inside a single job (see section Examples below).
  • CPU/Socket/Core/Thread
    • A Noctua1 node has
      • 2 sockets (i.e. physical CPUs)
      • 20 cores per socket
      • 1 thread per core
      • In summary 40 CPUs (= 2 sockets * 20 cores * 1 threads)
    • If nodes are configured with hyperthreading, then a CPU is equivalent to a hyperthread. Otherwise a CPU is equivalent to a core. You can determine if your nodes have more than one thread per core using the command scontrol show node and looking at the values of ThreadsPerCore.

OpenCCS and Slurm Command Comparision Overview[Bearbeiten]

Purpose OpenCCS Slurm
Job-Submission ccsalloc salloc, sbatch, srun
Job-Altering ccsalter scontrol
Job-Killing ccskill scancel
Getting information ccsinfo scontrol, sinfo, spredict, sprio, squeue, sreport, sshare, sstat
Inspect Accounting- /Log-Data ccstracejob sacct
Dashboard ccsdashboard scluster
Send message to job-log files ccsmsg -
Send signals to job ccssignal -
Rebind to an interactiv job ccsbind sattach
High level job scripts ccsworker -
Graphical-UI - sview, smap


Examples[Bearbeiten]

OpenCCS required you to specify the number of chunks and the number of cpus on each chunk.
In Slurm, you may specify

  • the number of nodes (-N or --nodes)
  • the total number of tasks (-n or --ntasks)
  • and/or the number of tasks on each node (--ntasks-per-node)

So: TotalTasks = Nodes · TasksPerNode

In the following table, we give some examples on how to convert OpenCCS resource specifications to Slurm.

Purpose OpenCCS Slurm
Request X cores and do not care where they are -c X or X:ncpus=1 -n X or --ntasks=X
Request 8 nodes with 4 cores and 32GB memory each 8:ncpus=4:mem=32G --nodes=8 --ntasks-per-node=4 --mem=32G
Request X nodes exclusively -n X or --nodes=X -N X or --nodes=X
Request 4 nodes with 4 cores and 32GB memory each in exclusive mode ncpus=4:mem=32G,place=:excl --ntasks-per-node=4 --mem=32G --exclusive
Number of (MPI) processes per node mpiprocs=X --tasks-per-node=X
Request one core with X threads ncpus=1:ompthreads=X -n 1 and export OMP_NUM_THREADS=X
Print brief job/queue info ccsinfo -s [jobid] squeue
Print all my jobs ccsinfo -s --mine squeue -u <userid>
Print detailed job information ccsinfo jobid `scontrol show job jobid` or `sstat -j jobid` or `sacct –X -u myusername`
Get Node List `$(cat $CCS_NODEFILE)` `LIST=$(srun hostname)`
Get accounting data ccstracejob jobid sacct -j jobid

Using srun[Bearbeiten]

Use srun within your sbatch job script for fine-grained control over where commands in the script run. Here we see two separate commands, each running on 10 cores of the 20 total cores:

#!/bin/bash
#SBATCH -N 2 -n 20
srun -n 10 myfirstcommand
srun -n 10 mysecondcommand

Farming Jobs[Bearbeiten]

If you have multiple independent serial tasks, you can pack them together into a single Slurm job. This is suitable for simple task-farming. The following job submission script will ask Slurm for 8 CPUs, then it will run the myprog program 1000 times with arguments passed from 1 to 1000. But with the -n1 --exclusive option, it will control that at any point in time only 8 instances are effectively running, each being allocated to one CPU.

#! /bin/bash
#
#SBATCH --ntasks=8
for i in {1..1000}
do
   srun -n1 --exclusive ./myprog $i &
done
wait

The for-loop can be replaced with GNU parrallel if installed on your system: parallel -P $SLURM_NTASKS srun -n1 --exclusive ./myprog ::: {1..1000}

Refer also to: https://github.com/paddydoyle/staskfarm or https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html

Master/Worker Program Example[Bearbeiten]

#!/bin/bash
#
#SBATCH --job-name=test_ms
#SBATCH --output=res_ms.txt
#
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun --multi-prog multi.conf

With file multi.conf being, for example, as follows:

0      echo I am the Master
1-3    echo I am worker %t

The above instructs Slurm to create four tasks (or processes), one running echo 'I am the Master', and the other 3 running echo I am worker %t. The %t placeholder will be replaced with the task id. This is typically used in a producer/consumer setup where one program (the master) create computing tasks for the other program (the workers) to perform.

Upon completion of the above job, file res_ms.txt will contain (not necessarily in the same order).

I am worker 2
I am worker 3
I am worker 1
I am the Master

Hybrid jobs[Bearbeiten]

You can mix multi-processing (MPI) and multi-threading (OpenMP) in the same job, simply like this:

#! /bin/bash
#
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./myprog

Requesting GPUs[Bearbeiten]

You can do this by giving the total number of GPUs needed (--gpus), the number of GPUs per node (--gpus-per-node), or the number per task (--gpus-per-task).

  • Request 2 nodes with 2 GPUs each
    #SBATCH --nodes=2
    #SBATCH --gpus-per-node=2
  • Request one node with use of 6 CPU cores and 1 GPU:
    #SBATCH --ntasks=6
    #SBATCH --gpus-per-node=1

(Alternative) GPU Resource Specification[Bearbeiten]

If one wants to access the GPU devices on a node one must specify the generic consumable resources flag (a.k.a. gres flag). The gres flag has the following syntax: --gres=$(resource_type)[:$(resource_name):$(resource_count)] where:

  • $(resource_type) is always equal to gpu string for the GPU devices.
  • $(resource_name) is a string which describes the type of the requested gpu(s) e.g., gtx1080, rtx2080, ....
  • $(resource_count) is the number of gpu devices that are requested of the type $(resource_name)$.

Its value is an integer in the closed interval: {1,max. number of devices on one node} the [ ] means optional parameters, that is, to request any single GPU (the default for the count is 1), regardless of a type, --gres=gpu will work. To request more than one GPU of any type, one can add the $(resource_count), e.g. --gres=gpu:2

For example, the flag --gres=gpu:rtx2080:2 must be used to request 2 GTX RTX2080 devices.

Note that if you do not specify the gres flag, your job will run on a GPU node (presuming you use the correct combination of the --partition and --account flag), but it will not have access the node's GPUs.


Roadmap[Bearbeiten]

Depending on the availability of Noctua2 (tentative planning is begin of Q2 2022), the following major steps are currently planned:

  • Migrate projects from Noctua1 to Noctua2
  • Open Noctua2 and close Noctua1 for a maintenance. During this (one week) maintenance, we will
    • Install missing SW-packages
    • Prepare Noctua1 filesystems for the OCuLUS-projects
    • Migrate GPUs from OCuLUS to Noctua1
    • Re-configure Slurm
    • Mount OCuLUS PC2PFS in read-only mode on the Noctua1 frontends
    • Mount PC2SCRATCH in read-only mode on the Noctua1 frontends
  • Open Noctua1 and close OCuLUS
  • About 2-3 months later the mounts of OCuLUS PC2PFS and PC2SCRATCH on the Nocuta frontends will be closed.

If you have questions or comments, please reply or send an Email to pc2-support@uni-paderborn.de.