Performance tuning depends on software and hardware characteristics of the target system(s). One important hardware component is the processor. The Noctua system has Intel Xeon processors and the installed Intel Programming Environment comes with some useful tools (load module intel). The cpuinfo utility prints out the processor architecture information.
> cpuinfo -gdcs Intel(R) processor family information utility, Version 2018 Update 3 Build 20180411 (id: 18329) Copyright (C) 2005-2018 Intel Corporation. All rights reserved. ===== Processor composition ===== Processor name : Intel(R) Xeon(R) Gold 6148 Packages(sockets) : 2 Cores : 40 Processors(CPUs) : 40 Cores per package : 20 Threads per core : 1 ===== Placement on packages ===== Package Id. Core Id. Processors 0 0,1,2,3,4,8,9,10,11,12,16,17,18,19,20,24,25,26,27,28 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 1 0,1,2,3,4,8,9,10,11,12,16,17,18,19,20,24,25,26,27,28 20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39 ===== Cache sharing ===== Cache Size Processors L1 32 KB no sharing L2 1 MB no sharing L3 27 MB (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)(20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39) ===== Processor Signature ===== _________ ________ ______ ________ _______ __________ | xFamily | xModel | Type | Family | Model | Stepping | |_________|________|______|________|_______|__________| | 00 | 5 | 0 | 6 | 5 | 4 | |_________|________|______|________|_______|__________|
To get more information about the processor feature flags of the processor call cpuinfo -f on the target system.
The table below shows the compiler arguments required for automatic vectorization with AVX-512 using Intel C/C++/FORTRAN compilers 17/18 and the corresponding arguments for GNU and LLVM compilers.
|Compiler||Intel Compilers 17.x||Intel Compilers 18.x||GNU and LLVM|
|Cross-platform||-xCommon-AVX512||-xCommon-AVX512||-mfma -mavx512f -mavx512cd|
|SKL processors||-xCore-AVX512||-xCore-AVX512 -qopt-zmm-usage=high||-march=skylake-avx512|
More "Intel Compiler Features and Performance Tips" from the Intel Compiler Lab can be found here:
- on Noctua in $PC2SW/doc/IntelCompiler
All Noctua1 nodes are configured with hyperthreading switched off. If you have positive experience in using hyperhreading with this processor on other platforms, please send us a notice.
Most versions of the processor generation can run two threads in each of its cores, called hyperthreading (Xeon Gold 6148 supports HT). This means that each thread gets only half of the resources.
Level-1 cache, instruction fetch, decoding and execution units are shared between two threads running in the same core in the same way as in previous processors.
There is no advantage to running two threads per core if any of the shared resources are limiting factors for the performance. There are so many execution ports and execution units that execution is rarely a limiting factor. If the code aims at more than two instructions per clock cycle, or if cache size is a limiting factor, then there is no advantage in running two threads in each core. There is no way to give one thread higher priority than the other in the CPU.
Cache and Memory Access
There is a level-1 data cache and a level-1 code cache on each core. These caches are shared between threads running in the same core. There is also a local private unified level-2 cache for data and code. Additionally, there is a distributed level-3 cache shared between all cores.
The 256-bit or 512-bit read and write bandwidth makes it advantageous to use YMM or ZMM registers for copying or zeroing large blocks of memory. The REP MOVS instruction has full efficiency only if the source and destination are both aligned by 32. In all other cases, it is better to use a function library that uses YMM or ZMM registers.
The phenomenon of cache bank conflicts has been a performance problem in some previous processors. This problem has been removed now. It is always possible to do two cache reads in the same clock cycle without causing a cache bank conflict. However, the problem with false dependence between memory addresses with the same set and offset remains. It is not possible to read and write simultaneously from addresses that are spaced by a multiple of 4 Kbytes.
The theoretical maximum throughput is two cache reads and one write per clock cycle. However, this throughput cannot be maintained continuously because of limited cache ways, read and write buffers, etc. Some of the memory writes may use port 2 or 3 for address calculation, rather than port 7, and thereby delaying a read.
Job Task Mapping - Setting and Controlling Affinity
Use srun to start a job with SLURM.
> srun -n <number_of_tasks>
Binding can be completely removed:
- all processes and their child process and threads are allowed to migrate across cores as determined by the Standard Linux process schedular.
- this is useful where processes spawn many short lived children or over-subscribe the node
Bind tasks in "block mode"
> srun -n <nt> --cpu_bind=rank
- binds task i to core modulo(i,40)
- 2 processors (sockets) with 20 cores (CPUs) each (processor 0 = cores 0 .. 19; processor 1 = cores 20 .. 39)
- the mapping is the same for all nodes
Users can bind tasks explicitly to CPUs
> srun -n <nt> --cpu_bind=map_cpu:0,3,5,9
- tasks are mapped in a round robin way on the defined cores
- the mapping is the same for all nodes
The number of tasks per socket can be limited
> srun -n <nt> --ntasks-per-socket=<ntps>
- Places tasks on the sockets of a node in a round robin way.
- The tasks are allowed to migrate on the cores of the socket (not on the cores of the other socket).
Do not share allocated nodes
Distribute tasks=ranks=processes across allocated nodes according to pattern requested
Bind tasks to memory
Control collective operation algorithm:
Lustre File System
A main factor leading to the high performance of Lustre is the ability to stripe data over multiple Object Storage Targets (OSTs). The Lustre File System of Noctua has four OSTs. The striping information for a given file or files can be shown.
> lfs getstripe <file_name>
The stripe count, stripe size, and strip offset can be set on a file system, directory, or file level.
For a file per process pattern (n-to-n) a stripe count of 1 is often a good setting.
> lfs setstripe -c 1 -i -1 -S 4m <dir_name>'').
For files larger than 16 TB the stripe count should be higher than 1.
For shared files (n-to-1) the stripe count should cover all (four) OSTs.
> lfs setstripe -c -1 ...
Rules of thumb
- Many clients and many files: Do NOT stripe.
- Many clients one file: Do stripe.
- Some clients and few large files: Do stripe.
- alway allow the system to choose OSTs at random
Data transfer (read/write) considerations
- try to write chunks (1 MB)
- watch out for small file headers
- no Penalty for sparse files so use that if necessary
Lustre caches data
- clients can cache data (32 MB)
- but colliding accesses may flush from cahce
- optionally use O_DIRECT (and async I/O) but beware alignment and size restrictions
- use operations provided by MPI I/O (or other high-Level libraries HDF5, NETCDF,...)
- portable solution
- supports collective operations
- add I/O servers to application
- dedicated processes to perform time consuming operations
But keep in mind
- I/O is a shared resource, except timing variation
- you are not alone, different user interfere each other