Noctua-Tuning

Aus PC2 Doc
Wechseln zu: Navigation, Suche

Processor Architecture

Performance tuning depends on software and hardware characteristics of the target system(s). One important hardware component is the processor. The Noctua system has Intel Xeon processors and the installed Intel Programming Environment comes with some useful tools (load module intel). The cpuinfo utility prints out the processor architecture information.

> cpuinfo -gdcs
Intel(R) processor family information utility, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright (C) 2005-2018 Intel Corporation.  All rights reserved.

=====  Processor composition  =====
Processor name    : Intel(R) Xeon(R) Gold 6148
Packages(sockets) : 2
Cores             : 40
Processors(CPUs)  : 40
Cores per package : 20
Threads per core  : 1
=====  Placement on packages  =====
Package Id.     Core Id.        Processors
0               0,1,2,3,4,8,9,10,11,12,16,17,18,19,20,24,25,26,27,28            0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
1               0,1,2,3,4,8,9,10,11,12,16,17,18,19,20,24,25,26,27,28            20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39

=====  Cache sharing  =====
Cache   Size            Processors
L1      32  KB          no sharing
L2      1   MB          no sharing
L3      27  MB          (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)(20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39)

=====  Processor Signature  =====
 _________ ________ ______ ________ _______ __________
| xFamily | xModel | Type | Family | Model | Stepping |
|_________|________|______|________|_______|__________|
| 00      | 5      | 0    | 6      | 5     | 4        |
|_________|________|______|________|_______|__________|

To get more information about the processor feature flags of the processor call cpuinfo -f on the target system.

Compiler Flags

The table below shows the compiler arguments required for automatic vectorization with AVX-512 using Intel C/C++/FORTRAN compilers 17/18 and the corresponding arguments for GNU and LLVM compilers.


Compiler Intel Compilers 17.x Intel Compilers 18.x GNU and LLVM
Cross-platform -xCommon-AVX512 -xCommon-AVX512 -mfma -mavx512f -mavx512cd
SKL processors -xCore-AVX512 -xCore-AVX512 -qopt-zmm-usage=high -march=skylake-avx512

More "Intel Compiler Features and Performance Tips" from the Intel Compiler Lab can be found here:

Hyperthreading

All Noctua1 nodes are configured with hyperthreading switched off. If you have positive experience in using hyperhreading with this processor on other platforms, please send us a notice.

General note:

Most versions of the processor generation can run two threads in each of its cores, called hyperthreading (Xeon Gold 6148 supports HT). This means that each thread gets only half of the resources.

Level-1 cache, instruction fetch, decoding and execution units are shared between two threads running in the same core in the same way as in previous processors.

There is no advantage to running two threads per core if any of the shared resources are limiting factors for the performance. There are so many execution ports and execution units that execution is rarely a limiting factor. If the code aims at more than two instructions per clock cycle, or if cache size is a limiting factor, then there is no advantage in running two threads in each core. There is no way to give one thread higher priority than the other in the CPU.

Cache and Memory Access

There is a level-1 data cache and a level-1 code cache on each core. These caches are shared between threads running in the same core. There is also a local private unified level-2 cache for data and code. Additionally, there is a distributed level-3 cache shared between all cores.

The 256-bit or 512-bit read and write bandwidth makes it advantageous to use YMM or ZMM registers for copying or zeroing large blocks of memory. The REP MOVS instruction has full efficiency only if the source and destination are both aligned by 32. In all other cases, it is better to use a function library that uses YMM or ZMM registers.

The phenomenon of cache bank conflicts has been a performance problem in some previous processors. This problem has been removed now. It is always possible to do two cache reads in the same clock cycle without causing a cache bank conflict. However, the problem with false dependence between memory addresses with the same set and offset remains. It is not possible to read and write simultaneously from addresses that are spaced by a multiple of 4 Kbytes.

The theoretical maximum throughput is two cache reads and one write per clock cycle. However, this throughput cannot be maintained continuously because of limited cache ways, read and write buffers, etc. Some of the memory writes may use port 2 or 3 for address calculation, rather than port 7, and thereby delaying a read.


Job Task Mapping - Setting and Controlling Affinity

Use srun to start a job with SLURM.

> srun -n <number_of_tasks>

Binding can be completely removed:

--cpu_bind=none
  • all processes and their child process and threads are allowed to migrate across cores as determined by the Standard Linux process schedular.
  • this is useful where processes spawn many short lived children or over-subscribe the node

Bind tasks in "block mode"

> srun -n <nt> --cpu_bind=rank
  • binds task i to core modulo(i,40)
    • 2 processors (sockets) with 20 cores (CPUs) each (processor 0 = cores 0 .. 19; processor 1 = cores 20 .. 39)
  • the mapping is the same for all nodes

Users can bind tasks explicitly to CPUs

> srun -n <nt> --cpu_bind=map_cpu:0,3,5,9
  • tasks are mapped in a round robin way on the defined cores
  • the mapping is the same for all nodes

The number of tasks per socket can be limited

> srun -n <nt> --ntasks-per-socket=<ntps>
  • Places tasks on the sockets of a node in a round robin way.
  • The tasks are allowed to migrate on the cores of the socket (not on the cores of the other socket).

Do not share allocated nodes

--exclusive

Distribute tasks=ranks=processes across allocated nodes according to pattern requested

--distribution=<block|cyclic|arbitrary<plane=<Options>[block|cyclic]>

Bind tasks to memory

--mem_bind=[{quiet,verbose},{local,none}]


Intel MPI

Control collective operation algorithm:

Reference:

https://software.intel.com/en-us/mpi-developer-reference-linux-i-mpi-adjust-family


Lustre File System

A main factor leading to the high performance of Lustre is the ability to stripe data over multiple Object Storage Targets (OSTs). The Lustre File System of Noctua has four OSTs. The striping information for a given file or files can be shown.

> lfs getstripe <file_name>

The stripe count, stripe size, and strip offset can be set on a file system, directory, or file level.

For a file per process pattern (n-to-n) a stripe count of 1 is often a good setting.

> lfs setstripe -c 1 -i -1 -S 4m <dir_name>'').

For files larger than 16 TB the stripe count should be higher than 1.

For shared files (n-to-1) the stripe count should cover all (four) OSTs.

> lfs setstripe -c -1 ...

Rules of thumb

  • Many clients and many files: Do NOT stripe.
  • Many clients one file: Do stripe.
  • Some clients and few large files: Do stripe.
  • alway allow the system to choose OSTs at random

Data transfer (read/write) considerations

  • try to write chunks (1 MB)
  • watch out for small file headers
  • no Penalty for sparse files so use that if necessary

Lustre caches data

  • clients can cache data (32 MB)
  • but colliding accesses may flush from cahce
  • optionally use O_DIRECT (and async I/O) but beware alignment and size restrictions

Asynchronous I/O

  • use operations provided by MPI I/O (or other high-Level libraries HDF5, NETCDF,...)
    • portable solution
    • supports collective operations
  • add I/O servers to application
    • dedicated processes to perform time consuming operations

But keep in mind

  • I/O is a shared resource, except timing variation
  • you are not alone, different user interfere each other


References:

http://www.nersc.gov/users/storage-and-file-systems/i-o-resources-for-scientific-applications/optimizing-io-performance-for-lustre/

https://www.nas.nasa.gov/hecc/support/kb/lustre-best-practices_226.html

https://www.nics.tennessee.edu/computing-resources/file-systems/io-lustre-tips