Performance tuning can be done on different levels:
Parallel File System ($PC2PFS)
Chunks and Striping
A main factor leading to the high performance of BeeGFS is the ability to stripe data over the 6 storage nodes.
The default chunk size is 2M. Files larger than 2M are striped over the 6 storage nodes. Files with a size up to 2M are mapped to a single storage node in a roundrobin manner.
The chunk size and striping information for a given file or directory can be shown by using beegfs-ctl.
[user@fe2 ~]$ beegfs-ctl --getentryinfo /scratch/hpc-prf-foo/MYFILE EntryID: 61-5E460338-37F7 Metadata buddy group: 1 Current primary metadata node: meta01 [ID: 14327] Stripe pattern details: + Type: RAID0 + Chunksize: 2M + Number of storage targets: desired: 1; actual: 1 + Storage targets: + 40642 @ storage05 [ID: 19371]
One can change the chunk size and the striping pattern by:
[user@fe2] beegfs-ctl --setpattern --chunksize=512k --numtargets=4 /scratch/hpc-prf-foo/MYFILE New chunksize: 524288 New number of storage targets: 4 [user@fe2 ~]$ beegfs-ctl --getentryinfo /scratch/hpc-prf-foo/MYFILE EntryID: 61-5E460338-37F7 Metadata buddy group: 1 Current primary metadata node: meta01 [ID: 14327] Stripe pattern details: + Type: RAID0 + Chunksize: 512k + Number of storage targets: desired: 4; actual: 1 + Storage targets: + 40642 @ storage05 [ID: 19371] + Storage Pool: 1 (Default)
For more information refer to beegfs-ctl tool itself which has a built-in help system.
Migrating Existing Metadata
Since Feb 2020, we have enabled [metadata mirroring] for the root directory of the BeeGFS (/scratch), as well as all the files contained in it. Note that existing directories (created before Feb 2020) except for the root directory will not be mirrored automatically. If a file is moved into a directory with active metadata mirroring, it will have its metadata mirrored. On the other hand, directories will not automatically be mirrored when moved. For a directory to be mirrored, it is therefore necessary to freshly create a directory inside a mirrored directory. The easiest way to enable mirroring for a whole directory tree is to do a recursive copy:
$ cp -a <directory> <mirrored-dir>
This will also copy the file contents, therefore it is possible to use it for enabling metadata and storage mirroring at the same time. Therefore if you need a higher level of security for already existing files in $PC2PFS, you should copy them as described above.
[user@fe2 ~]$ beegfs-ctl --help BeeGFS Command-Line Control Tool (http://www.beegfs.com) Version: 7.1.4 GENERAL USAGE: $ beegfs-ctl --<modename> --help $ beegfs-ctl --<modename> [mode_arguments] [client_arguments] MODES: --listnodes => List registered clients and servers. --listtargets => List metadata and storage targets. ... USAGE: This is the BeeGFS command-line control tool. Choose a control mode from the list above and use the parameter "--help" to show arguments and usage examples for that particular mode. Example: Show help for mode "--listnodes" $ beegfs-ctl --listnodes --help
- PC2PFS is a shared resource. Different user interfere each other.
- BeeGFS default configuration is to map files in a roundrobin manner to the storage nodes. Hence if you have small files there is no need to change the striping pattern.
- Asynchronous I/O: use operations provided by MPI I/O (or other high-Level libraries HDF5, NETCDF,...).
Source Code Level
- Memory hierarchie optimization (register files, caches, ...)
- Optimal data alignment (address bonderies, blocking, padding, ...)
- Loop optimization, pipelining
- Use already existing optimized libraries (Intel MKL, Atlas, NAG, ...)
- see "Intel MKL Link Line Advisor" https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor
- Mind the I/O (MPI-I/O, PnetCDF, ...)
- The Intel and GNU compilers support flags that enables code generation for specific processor types (Sandy-Bridge E5-2600/E5-4600 -march=corei7-avx and Nehalem X5600 -march=corei7).
- Intel compiler: With optimization level -O2 or higher the Compiler will look for vectorization opportunities. The user should use one of the flags (-x <ext>, -m <ext>, -ax <ext>, -xHost) to select the appropriate instruction set extension (AVX, SSE4.2, SSE4.1, ..).
- GNU compiler: With optimization level -O3 or higher the Compiler will look for vectorization opportunities. The flags -m sse4.2 or -m avx can be used to select the appropriate instruction set extention.
- Guided auto-parallelization
- The Intel Compiler is able to provide the programmer with information for guided auto-parallelization, auto-vectorization, and data transformation. The flag -guide enforces the compiler to generates diagnostics for the user to improve the code.
The Workload Manager (OpenCCS) places processes on hardware resources. The placement can be performance critical (influence the execution time). If a process shares resources with other processes (L3 cache of a processor, interface to main memory, interface to high speed communication network, switch in the communication network, IO device, ..), the execution time of the process can increase. Evaluate different process placements to get optimal performance or efficiency.
The placement of threads to processor cores (pinning) can effect the execution time of an application.
Tools to control the CPU affinity are taskset and likwid-pin.
OpenMP has also some features to place threads to cores (see OpenMP Tuning).
OpenMPI has a build-in mechanism to control the CPU affinity.
OpenMPI Run-time Tuning
Options used to control the CPU affinity are: --bind-to-socket, --bysocket, --bind-to-core, --bycore, --npersocket, and --slot-list
These options must be used with care, especially if nodes are not allocated exclusively (potentialy shared with other users).
The --report-bindings option will show where OpenMPI actually bound the processes.
For further information, google the keywords "Open MPI run-time tuning"
OpenMP Run-time Tuning
- Number of threads: OMP_NUM_THREADS environment variable
- Thread stack size: OMP_STACKSIZE env. variable
- Iteration scheduling: OMP_SCHEDULE env. variable or schedule clause in the parallel Loop constructs
- Thread affinity: OMP_PROC_BIND env. variable
- Intel compiler: KMP_AFFINITY env. variable
- GNU compiler: GOMP_CPU_AFFINITY env. variable
- Synchronization: OMP_WAIT_POLICY env. variable
- Intel Compiler: KMP_BLOCKTIME env. variable
- Intel: Thread Affinity Interface
- openmpi run-time tuning