Noctua FPGA Usage, Integration and Development
Inhaltsverzeichnis
- 1 Updates
- 2 Overview
- 3 Applications with Integrated FPGA Acceleration
- 4 FPGA System Access and BSP Selection
- 5 Setting Up the FPGA Software Environment
- 6 Tool Info and Board Name
- 7 Sanity check: Actual Hardware on Noctua FPGA Node
- 8 Emulation of FPGA OpenCL kernels
- 9 Host code setup for emulation and hardware execution
- 10 Hardware Builds (full synthesis) of FPGA OpenCL kernels
- 11 Local SSDs
- 12 Serial Channel Point to Point Connections between Boards
- 13 Troubleshooting
Updates
2021
- 7 Jan
- Installed new Intel FPGA SDK for OpenCL 20.4.0
- Added reference to Intel design examples
2020
- 4 Dec
- Updated BSP name from legacy nalla_pcie to bittware_520n. On Noctua, now both modules are available to keep existing scripts working.
- Small updates to documentation
- Small updates to BSP module files
- 29 Oct
- Installed new Intel FPGA SDK for OpenCL 20.3.0
- Small updates to module files reflect changed structure of binaries inside the SDK
- 26 Jul
- Installed new Intel FPGA SDK for OpenCL 20.2.0
- 17. Apr
- Added overview of typical hardware build times and memory usage for job allocation.
- 16. Apr
- Installed new Intel FPGA SDK for OpenCL 20.1.0
- New documenatition for compatibility matrix of BSP and SDK version.
- 24. Mar
- New configuration of synthesis queue. Refer to the workload manager documentation for details.
- Installed new 19.4.0 Bittware BSP.
- 06. Mar
- Extended description on host code linkage and platform selection for hardware execution and emulation.
- 08. Jan
- Fast emulation for Intel FPGA SDK for OpenCL 19.3.0 and newer is now supported also on frontends, compute and synthesis nodes.
2019
- 18. Dec
- Installed new Intel FPGA SDK for OpenCL 19.4.0
- Provided script to access custom topology information also when using salloc.
- Created overview section and updated some details of the documentation.
- 16. Dec
- Start of update log for FPGA infrastructure and documentation.
- Updated structure of module files for Bittware BSPs and Intel FPGA SDKs: export an environment variable FPGA_BOARD_NAME that you can use in your build system.
- Updated documentation of emulation: fast emulation on now supported starting with 19.3.0 Intel FPGA SDK on fpga-nodes.
Overview
The Noctua FPGA partition is equipped with Bittware 520N cards with Intel FPGA Stratix 10 GX 2800. To execute or develop FPGA designs for this partition, two components are required, which both are available in different versions:
- A Bittware board support package (BSP) including an FPGA base bitstream and drivers.
- The Intel FPGA SDK for OpenCL with further documentation by Intel. Relevant documentation includes in particular
- Intel FPGA SDK for OpenCL Programming Guide
- Intel FPGA SDK for OpenCL Best Practices Guide
For hardware execution, fpga-nodes with the fitting BSP version need to be allocated, see Section FPGA System Access. The software environment is setup using modules, see Section Setting Up the FPGA Software Environment. During the early operations phase, Bittware BSPs versions and Intel FPGA SDK versions were deployed in lockstep, but now BSPs can also be used with newer SDK versions.
Applications with Integrated FPGA Acceleration
Application | Type of Support |
---|---|
CP2K for DFT with FPGA Acceleration of the Submatrix Method | Ready-to-use module files and bitstreams deployed on Noctua |
CP2K for DFT with FPGA Acceleration of 3D FFTs | FPGA support in CP2K main repository + extra repository with FPGA designs fitting the Bittware 520N cards in Noctua |
HPCC FPGA: HPC Challenge Benchmark Suite for FPGAs | Repository with benchmark suite targeting FPGAs from Intel (including the Bittware 520N with Stratix 10 cards in Noctua) and Xilinx |
Cannon Matrix Multiplication on FPGAs | Repository with implementation of Cannon matrix multiplication as building block for GEMM on FPGAs fitting the Bittware 520N cards in Noctua |
Intel Design Examples
The latest Intel examples are shipped with the Intel FPGA SDK for OpenCL. On Noctua:
[user@fpga-0017 examples_aoc]$ module load intelFPGA_pro [user@fpga-0017 examples_aoc]$ cd $INTELFPGAOCLSDKROOT/examples_aoc [user@fpga-0017 examples_aoc]$ ls asian_option compression extlibs fft1d_offchip jpeg_decoder library_hls_sot loopback_hostpipe matrix_mult optical_flow vector_add channelizer compute_score fd3d fft2d library_example1 library_matrix_mult Makefile multithread_vector_operation sobel_filter video_downscaling common double_buffering fft1d hello_world library_example2 local_memory_cache mandelbrot n_way_buffering tdfir web
You can copy any interesting example to your working directory.
FPGA System Access and BSP Selection
To use Noctua nodes with FPGAs, along with your Slurm command, you need to select the partition and provide a constraint to specify the version of the Bittware board support package (BSP) that your designs have been built for. E.g.
srun --partition=fpga --constraint=19.4.0_max
Constraints can be used together with srun, sbatch and salloc, however under some conditions salloc will fail, for details refer to the end of this section. We recommend to always use srun or sbatch.
The list of supported board support package (BSP) versions is.
19.4.0_max 19.4.0_hpc 19.2.0_max 19.2.0_hpc 19.1.0_max 18.1.1 18.1.1_hpc 18.0.1 18.0.0
A list of available matching versions of the Intel(R) FPGA SDK for OpenCL(TM) can be found here .
For hardware execution on the FPGA nodes, you always must specify the job constraint that fits the BSP version you used to synthesize the bitstream (.aocx file). For details refer to the next subsection. If you see a message like the one below, you have used the wrong constraint, which can impact your own results or make the machine unavailable for subsequent jobs.
MMD INFO : [aclbitt_s10_pcie1] Quartus versions for base and import compile do not match MMD INFO : [aclbitt_s10_pcie1] Board is currently programmed with sof from Quartus 19.4.0 64 MMD INFO : [aclbitt_s10_pcie1] PR import was compiled with Quartus 19.2.0 57
When designing and creating new FPGA bitstreams, we recommend to always use the latest version. Along with stability improvements, the 19.2.0 BSPs also allow for up to ~100MHz higher clock frequencies. BSP versions prior to 19.2.0 may become deprecated in the future. Use older BSP versions only to reuse existing bitstreams synthesized earlier.
For the 19.2.0 and 18.1.1 tools, there are two versions of the BSP that target the same board. The 19.2.0_max BSP enables the external serial channels and thus offers the default functionality for our setup. The 19.2.0_hpc BSP does not offer external serial channels, but may enable higher clock frequencies for this tool version.
Identifying BSP versions of existing bitstreams
You can find out the constraint required for an existing bitstream using two commands of the aocl binedit tool, e.g.
module load intelFPGA_pro aocl binedit build/19_3/krn_auto/volume_dummy_v10.aocx print .acl.board aocl binedit build/19_3/krn_auto/volume_dummy_v10.aocx print .acl.board_package
Finding the output values of aocl binedit in the first two columns allows you to identify the contraint to be used.
.acl.board | .acl.board_package | --constraint |
---|---|---|
p520_max_sg280l | /cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10 | 19.4.0_max |
/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10 | 19.2.0_max | |
/cm/shared/opt/intelFPGA_pro/19.1/hld/board/bittware_pcie/s10 | 19.1.0 | |
/opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie | 18.1.1_max | |
/opt/intelFPGA_pro/18.0.1/hld/board/nalla_pcie | 18.0.1 | |
/opt/intelFPGA_pro/18.0.0/hld/board/nalla_pcie | 18.0.0 | |
p520_hpc_sg280l | /cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10 | 19.4.0_hpc |
/cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10_hpc_default | 19.4.0_hpc | |
/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10 | 19.2.0_hpc | |
/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10_hpc_default | 19.2.0_hpc | |
/opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie | 18.1.1_hpc | |
p520_max_sg280h | unsupported | |
p520_hpc_sg280h | unsupported |
Unsupported BSPs: Since the 19.1.0 tools, the BSP comes with support for two different target boards: p520_max_sg280l and p520_max_sg280h. These differentiate between boards with so-called L-Tile and H-Tile FPGAs. Our boards contain L-Tile FPGAs, so only use p520_max_sg280l (and p520_hpc_sg280l for selected BSP versions) as targets for synthesis.
Issues with salloc
Constraints can be specified together with srun, sbatch and salloc. However, salloc only works well when the constraints are already satisfied by available nodes. We recommend to always use srun or sbatch.
A problem occurs when one of the nodes to be allocated is configured for a different constraint and is currently in use. Then salloc will fail with the following error message.
salloc: error: Job submit/allocate failed: Requested node configuration is not available
Workaround: use an allocation without requesting specific node names.
A problem also occurs when one of the nodes to be allocated is configured for a different constraint and is currently free. The allocation succeeds while the nodes are still reconfigured, programs or scripts starting during this time will fail, actual errors encountered differ.
Setting Up the FPGA Software Environment
Designing, emulating and executing FPGA designs requires a common software environment. The Bittware (formerly Nallatech) BSP (including drivers) and the Intel(R) FPGA SDK for OpenCL(TM) are setup using modules with the respective commands:
module load bittware_520n module load intelFPGA_pro
For any actual FPGA usage, the version of the BSP module (bittware_520n) version must match the node version allocated with the constraint. By default, the latest supported version is loaded. To use a specific version, you can append the version number like this:
module load bittware_520n/19.4.0_hpc intelFPGA_pro/20.3.0
Matching SDK versions and BSP versions
Bittware BSP modules (bittware_520n) can always be used with Intel(R) FPGA SDK for OpenCL(TM)] modules (intelFPGA_pro) with the same and newer version numbers up to one major version update. For hardware execution, the bittware_520n BSP must always match the allocated constraint. The following combinations of bittware_520n BSPs and intelFPGA_pro SDKs are supported:
bittware_520n modules (make sure to match the allocated constraint) | |||||||
---|---|---|---|---|---|---|---|
19.4.0_max/_hpc | 19.2.0_max/_hpc | 19.1.0 | 18.1.1_max/_hpc | 18.0.1 | 18.0.0 | ||
intelFPGA_pro modules | 20.4.0 | yes, recommended | yes | yes | |||
20.3.0 | yes | yes | yes | ||||
20.2.0 | yes | yes | yes | ||||
20.1.0 | yes | yes | yes | ||||
19.4.0(_max/_hpc) | yes | yes | yes | yes | yes | yes | |
19.3.0 | yes | yes | yes | yes | yes | ||
19.2.0(_max/_hpc) | yes | yes | yes | yes | yes | ||
19.1.0 | yes | yes | yes | yes | |||
18.1.1(_max/_hpc) | yes | yes | yes | ||||
18.0.1 | yes | yes | |||||
18.0.0 | yes |
When designing and creating new FPGA bitstreams, we recommend to always use the latest version. Along with stability improvements, the 19.4.0 and 19.2.0 BSPs also allow for up to ~100MHz higher clock frequencies than previous versions. BSP versions prior to 19.4.0 may become deprecated in the future. Use older BSP versions only to reuse existing bitstreams synthesized earlier.
Required gcc versions
When you link to the Intel FPGA OpenCL runtime, as well as for building most current examples, you have to use a newer compiler than the one that is loaded by default (gcc 4.8). The following modules have successfully been tested
module load compiler/GCCcore/8.2.0 module load compiler/GCCcore/7.3.0 module load compiler/GCCcore/6.4.0 module load compiler/GCCcore/5.4.0 module load gcc/7.2.0 module load gcc/6.1.0
Since the 19.3 release of the Intel FPGA SDK Intel has documented the underlying requirements for a libstdc++ in the LD_LIBRARY_PATH, that gets fulfilled with the documented modules.
Tool Info and Board Name
You can confirm the used tool version and find out the name of the target platform that you need to build hardware designs like this.
[tester@fe-1 ~]$ aoc -version Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler Version 19.1.0 Build 240 Pro Edition Copyright (C) 2019 Intel Corporation [tester@fe-1 ~]$ aoc -list-boards Board list: p520_max_sg280h Board Package: /cm/shared/opt/intelFPGA_pro/19.1/hld/board/bittware_pcie/s10 Channels: kernel_input_ch0, kernel_output_ch0, kernel_input_ch1, kernel_output_ch1, kernel_input_ch2, kernel_output_ch2, kernel_input_ch3, kernel_output_ch3 p520_max_sg280l Board Package: /cm/shared/opt/intelFPGA_pro/19.1/hld/board/bittware_pcie/s10 Channels: kernel_input_ch0, kernel_output_ch0, kernel_input_ch1, kernel_output_ch1, kernel_input_ch2, kernel_output_ch2, kernel_input_ch3, kernel_output_ch3
or
[tester@fe-1 matrix_mult]$ aoc -version Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler Version 18.1.1 Build 263 Pro Edition Copyright (C) 2018 Intel Corporation [tester@fe-1 matrix_mult]$ aoc -list-boards Board list: p520_hpc_sg280l Board Package: /opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie p520_max_sg280l Board Package: /opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie Channels: kernel_input_ch0, kernel_output_ch0, kernel_input_ch1, kernel_output_ch1, kernel_input_ch2, kernel_output_ch2, kernel_input_ch3, kernel_output_ch3
The tools are available on all nodes and frontends.
Sanity check: Actual Hardware on Noctua FPGA Node
You can perform a status and sanity check on the actual FPGA hardware like this. This test does not work on the frontends or compute nodes, but only on the fpga-nodes (fpga-0001 to fpga-0016).
[tester@fpga-0006 ~]$ aocl diagnose -------------------------------------------------------------------- ICD System Diagnostics -------------------------------------------------------------------- Using the following location for ICD installation: /etc/OpenCL/vendors Found 4 icd entry at that location: /etc/OpenCL/vendors/intel-cpu.icd /etc/OpenCL/vendors/intel-neo.icd /etc/OpenCL/vendors/Intel_FPGA_SSG_Emulator.icd /etc/OpenCL/vendors/Altera.icd The following OpenCL libraries are referenced in the icd files: libintelocl.so libigdrcl.so libintelocl_emu.so libalteracl.so Checking LD_LIBRARY_PATH for registered libraries: libalteracl.so was registered on the system at /cm/shared/opt/intelFPGA_pro/20.3.0/hld/host/linux64/lib Using the following location for fcd installations: /opt/Intel/OpenCL/Boards Found 1 fcd entry at that location: /opt/Intel/OpenCL/Boards/bitt_s10_pcie.fcd The following OpenCL libraries are referenced in the fcd files: /cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10/linux64/lib/libbitt_s10_pcie_mmd.so Checking LD_LIBRARY_PATH for registered libraries: /cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10/linux64/lib/libbitt_s10_pcie_mmd.so was registered on the system. Number of Platforms = 2 1. Intel(R) FPGA Emulation Platform for OpenCL(TM) | Intel(R) Corporation | OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3 2. Intel(R) FPGA SDK for OpenCL(TM) | Intel(R) Corporation | OpenCL 1.0 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3 -------------------------------------------------------------------- ICD diagnostics PASSED -------------------------------------------------------------------- -------------------------------------------------------------------- BSP Diagnostics -------------------------------------------------------------------- -------------------------------------------------------------------- Device Name: acl0 BSP Install Location: /cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10 Vendor: BittWare ltd MMD INFO : QSFP Module 0 : serial number : EO1190226 power class : 3 MMD INFO : QSFP Module 0 : Setting to high power mode MMD INFO : QSFP Module 0 : qsfpid 0x0d : power class : 3 (power class register is 0x02) MMD INFO : QSFP Module 0 : temperature : 48.32 degC MMD INFO : QSFP Module 1 : serial number : EO1190226 power class : 3 MMD INFO : QSFP Module 1 : Setting to high power mode MMD INFO : QSFP Module 1 : qsfpid 0x0d : power class : 3 (power class register is 0x02) MMD INFO : QSFP Module 1 : temperature : 26.15 degC MMD INFO : QSFP Module 2 : serial number : EO1190226 power class : 3 MMD INFO : QSFP Module 2 : Setting to high power mode MMD INFO : QSFP Module 2 : qsfpid 0x0d : power class : 3 (power class register is 0x02) MMD INFO : QSFP Module 2 : temperature : 38.14 degC MMD INFO : QSFP Module 3 : serial number : EO1190226 power class : 3 MMD INFO : QSFP Module 3 : Setting to high power mode MMD INFO : QSFP Module 3 : qsfpid 0x0d : power class : 3 (power class register is 0x02) MMD INFO : QSFP Module 3 : temperature : 38.37 degC Phys Dev Name Status Information aclbitt_s10_pcie0 Passed BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie0) PCIe dev_id = 5170, sub_dev_id = 5204, bus:slot.func = 16:00.00, Gen3 x8 Card serial number = EO1190226. FPGA temperature = 40 degrees C. Total Card Power Usage = 67.1114 Watts. Serial Channel 0 status reg = 0x0ff11ff1. Serial Channel 1 status reg = 0x0ff11ff1. Serial Channel 2 status reg = 0x0ff11ff1. Serial Channel 3 status reg = 0x0ff11ff1. Current PR ID reg = 0x3ccfa79a(1020241818). DIAGNOSTIC_PASSED -------------------------------------------------------------------- -------------------------------------------------------------------- Device Name: acl1 BSP Install Location: /cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10 Vendor: BittWare ltd MMD INFO : QSFP Module 0 : serial number : EO1190226 power class : 3 MMD INFO : QSFP Module 0 : Setting to high power mode MMD INFO : QSFP Module 0 : qsfpid 0x0d : power class : 3 (power class register is 0x02) MMD INFO : QSFP Module 0 : temperature : 38.60 degC MMD INFO : QSFP Module 1 : serial number : EO1190226 power class : 3 MMD INFO : QSFP Module 1 : Setting to high power mode MMD INFO : QSFP Module 1 : qsfpid 0x0d : power class : 3 (power class register is 0x02) MMD INFO : QSFP Module 1 : temperature : 41.93 degC MMD INFO : QSFP Module 2 : serial number : EO1190226 power class : 3 MMD INFO : QSFP Module 2 : Setting to high power mode MMD INFO : QSFP Module 2 : qsfpid 0x0d : power class : 3 (power class register is 0x02) MMD INFO : QSFP Module 2 : temperature : 40.40 degC MMD INFO : QSFP Module 3 : serial number : EO1190226 power class : 3 MMD INFO : QSFP Module 3 : Setting to high power mode MMD INFO : QSFP Module 3 : qsfpid 0x0d : power class : 3 (power class register is 0x02) MMD INFO : QSFP Module 3 : temperature : 37.69 degC Phys Dev Name Status Information aclbitt_s10_pcie1 Passed BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie1) PCIe dev_id = 5170, sub_dev_id = 5204, bus:slot.func = af:00.00, Gen3 x8 Card serial number = EO1190226. FPGA temperature = 38 degrees C. Total Card Power Usage = 66.5158 Watts. Serial Channel 0 status reg = 0x0ff11ff1. Serial Channel 1 status reg = 0x0ff11ff1. Serial Channel 2 status reg = 0x0ff11ff1. Serial Channel 3 status reg = 0x0ff11ff1. Current PR ID reg = 0x3ccfa79a(1020241818). DIAGNOSTIC_PASSED -------------------------------------------------------------------- Call "aocl diagnose <device-names>" to run diagnose for specified devices Call "aocl diagnose all" to run diagnose for all devices
Emulation of FPGA OpenCL kernels
Before synthesis and hardware execution, it is highly recommended to check the functionality of your OpenCL design in emulation. Between the 19.1.0 and the 20.2.0 version of the Intel FPGA SDK for OpenCL, two emulation modes are implemented, with compiler flags to aoc controlling the mode (in all modes -march=emulator is used to target emulation in the first place). For our Noctua setup, we recommend using fast emulation for Intel FPGA SDK for OpenCL 19.3.0 or newer. 19.2.0 is the only version where you need to explicitly use a compiler argument to get to the recommended mode: -legacy-emulator.
Intel FPGA SDK for OpenCL Versions | Legacy Emulation | Fast Emulation | Recommended for Noctua |
---|---|---|---|
18.x.x | default | not available | Legacy Emulation |
19.1.0 | default | -fast-emulator | Legacy Emulation |
19.2.0 | -legacy-emulator | default | Legacy Emulation |
19.3.0 | -legacy-emulator | default | Fast Emulation |
19.4.0 | -legacy-emulator | default | Fast Emulation |
20.1.0 | -legacy-emulator | default | Fast Emulation |
20.2.0 | -legacy-emulator | default | Fast Emulation |
20.3.0 and later | no longer supported | default | Fast Emulation |
For setup of the host code for emulation, refer to the following section.
Host code setup for emulation and hardware execution
Linking your host code
For hardware execution, all link configurations obtained with aocl link-config are functional. For fast emulation, there are currently problems with the automatic link configuration on frontent and compute nodes.
Workaround 1
Build and test your host code on an fpga-node. You can use --constraint=emul if you don't need a specific BSP version for hardware tests.
Workaround 2
Manually set the right linkage options in your build process.
The configuration you want to get is for the respective tool versions:
[user@fpga-0005 ~]$ aocl link-config -L/cm/shared/opt/intelFPGA_pro/19.3.0/hld/host/linux64/lib -lOpenCL [user@fpga-0005 ~]$ aocl link-config -L/cm/shared/opt/intelFPGA_pro/19.4.0/hld/host/linux64/lib -lOpenCL
If you instead see something like this, you need to manually change your build flow
[user@ln-0002 ~]$ aocl link-config -L/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10/linux64/lib -L/cm/shared/opt/intelFPGA_pro/19.3.0/hld/host/linux64/lib -Wl,--no-as-needed -lalteracl -lbitt_s10_pcie_mmd -lelf [user@ln-0002 ~]$ aocl link-config -L/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10/linux64/lib -L/cm/shared/opt/intelFPGA_pro/19.4.0/hld/host/linux64/lib -Wl,--no-as-needed -lalteracl -lbitt_s10_pcie_mmd -lelf
You can check the correct linkage of your host code with ldd, e.g. ldd bin/host and compare with the outputs in the next section.
Selecting the right platform name in the host code
Many examples of OpenCL host codes either select the first OpenCL platform they find, or use a fixed identifier string to identify their target platform. When using fast emulation, you need to adapt your platform name inside the host code. Many actual host code examples convert the platform names to lower or upper case.
Execution mode | Platform name to match in host code | Required linker flag | Linking sanity check with ldd |
---|---|---|---|
Hardware execution | Intel(R) FPGA SDK for OpenCL(TM) | -lOpenCL or -lalteracl | any of the below versions |
Fast emulation | Intel(R) FPGA Emulation Platform for OpenCL(TM) | -lOpenCL | libOpenCL.so.1 => /cm/shared/opt/intelFPGA_pro/<QUARTUS_VERSION>/hld/host/linux64/lib/libOpenCL.so.1 (0x00002aaaaacba000) |
Legacy emulation | Intel(R) FPGA SDK for OpenCL(TM) | -lalteracl | libalteracl.so => /cm/shared/opt/intelFPGA_pro/<QUARTUS_VERSION>/hld/host/linux64/lib/libalteracl.so |
Hardware Builds (full synthesis) of FPGA OpenCL kernels
The full hardware builds for FPGA OpenCL kernels will take at least several hours. You can use the fpgasyn, long or batch partitions to submit hardware build jobs. Only the fpgasyn partition allows to share a node for multiple synthesis jobs, depending on your estimated main memory usage during synthesis. Below we provide a list with rough guidlines about the hardware build times to expect. With regard to time limits, you typically want to provide generous headroom (e.g. 4x, up to the full 3 day limit of fpgasyn) in your job submission in order to avoid automatic cancellation of a job that still could have finished. For memory limit, it can be more beneficial to provide estimates closer to the real utilization, in order to leave space for other jobs on the node.
Design properties | Estimated time | Estimated memory | Suggested arguments for fpgasyn partition |
---|---|---|---|
Low resource utilization (<10% in Kernel System) Simple memory interface (Global interconnect for < 10 global loads + stores) |
2-4h | 45 GB | --mem=45000MB --time=8:00:00 |
Medium resource utilization (<40% ALUTs and FFs, and <60% RAMs and DSPs in Kernel System) Simple to medium memory interface (Global interconnect for < 20 global loads + stores) |
8-12h | 60-90 GB | --mem=90000MB --time=24:00:00 |
High resource utilization (>50% ALUTs and FFs, or >70% RAMs and DSPs in Kernel System) Simple to medium memory interface (Global interconnect for < 20 global loads + stores) |
12-20h | 90-120 GB | --mem=120000MB --time=48:00:00 |
Any resource utilization Simple to medium memory interface (Global interconnect for > 100 global loads + stores) |
30-60h | 120+ GB | --exclusive --time=72:00:00 |
Moreover, the synthesis operations generate several GBs of result files. So, it is advisable to run kernel code synthesis in a directory on the $PC2PFS parallel file system for performance and capacity reasons.
For some tool versions (Intel FPGA SDK for OpenCL and/or Bittware BSP) there are specific optimization recommendations:
All current Bittware BSPs, starting with 18.1.1 (18.1.1_max and 18.1.1_hpc)
For most designs, it is recommended for kernel compilations to use the additional options -global-ring -duplicate-ring. These change the external memory interconnect structure between kernels and the base region to a structure that is optimized for Stratix-10 silicon. This has been seen to improve kernel fmax for BSPs with several banks of DDR such as the 520. Since the 19.1 Intel FPGA SDK for OpenCL software release, Intel has documented these settings in the AOCL Programming Guide and AOCL Best Practices Guide.
For example, to compile the vector addition example:
aoc -board=p520_max_sg280l vector_add.cl -global-ring -duplicate-ring
18.1.1 Intel FPGA SDK for OpenCL Known Issue
Under some circumstances, the 18.1.1 SDK can generate designs that contain severe timing violations. Don't use the generated bitstreams.
Other tool versions will only generate bitstreams with minor timing violations that are generally ok to execute in hardware. For severe timing violations, they don't generate a bitstream, but rather report routing failures.
Indicating warnings in aocl console output look as follows.
Warning: hold time violation of -0.232 ns on clock: board_inst|kernel_clk_gen|kernel_clk_gen|kernel_pll_outclk0
Corresponding outputs should show up in quartus_sh_compile.log and top.failing_clocks.rpt as
Info: Clock domains failing timing +--------+---------------+-----------------------------------------------------------------+------------------------+-----------------+ ; Slack ; End Point TNS ; Clock ; Operating conditions ; Timing analysis ; +--------+---------------+-----------------------------------------------------------------+------------------------+-----------------+ ; -0.429 ; -0.835 ; board_inst|mem_bank1|mem_emif_bank1|mem_emif_bank1_core_usr_clk ; Slow 900mV 100C Model ; Setup ; ; -0.347 ; -144.745 ; board_inst|kernel_clk_gen|kernel_clk_gen|kernel_pll_outclk0 ; Slow 900mV 0C Model ; Hold ;
18.0.1 Intel FPGA SDK for OpenCL
All compilations should use non-interleaving for a significant improvement in the kernel clock frequency. This is a limitation of the Intel 18.0.1 OpenCL compiler which is due to be resolved in the later compiler releases.
For example, to compile the vector addition example:
aoc -board=p520_max_sg280l -no-interleaving=default vector_add.cl
Local SSDs
The FPGA nodes contain a local SSD mounted as /tmp that can be used as additional scratch directory for performance critical local IO, e.g. for FPGA logging data.
Serial Channel Point to Point Connections between Boards
When configured with a p520_max_sg280l BSP, all FPGA boards offer 4 point-to-point connections to other FPGA boards. From the OpenCL environment, these are used as external serial channels. A status reg value of 0xfff11ff1 in the diagnose indicates an active connection. The topologies of these connections are fully configurable with each job allocation. There is a number of predefined topologies that can be selected with a shorthand notation like --fpgalink="pair", or you can provide a series of individual connection descriptions like --fpgalink="n00:acl0:ch0-n01:acl0:ch0".
Custom topologies
The notation nXX:aclY:chZ describes a unique serial channel endpoint within a job allocation according to the following pattern
- nXX, e.g. n02 specifies the node ID within your allocation, starting with n00 for the first node, n02 will specify the third node of your allocation. You can not use higher node IDs than the number of nodes requested by the allocation. At allocation time, the node ID is translated to a concrete node name, e.g. fpga-0008.
- aclY, e.i. acl0 and acl1 describe the first and second FPGA board within each node.
- chZ, e.i. ch0, ch1, ch2 and ch3 describe the 4 external channel connections for each board.
By specifying one unique pair of serial channel endpoints per --fpgalink argument, an arbitrary topology can be created within a job allocation. When the task starts, the topology will be summarized and for each fpgalink, an environment variable will be exported. E.g.
srun -A pc2-mitarbeiter --constraint=19.2.0_max -N 1 --fpgalink="n00:acl0:ch0-n00:acl1:ch0" --fpgalink="n00:acl0:ch1-n00:acl1:ch1" --fpgalink="n00:acl0:ch2-n00:acl1:ch2" --fpgalink="n00:acl0:ch3-n00:acl1:ch3" -p fpga --pty bash ... Summarizing most recent topology information and exporting FPGALINK variables: Host list fpga-0004 Generated connections FPGALINK0=fpga-0004:acl0:ch0-fpga-0004:acl1:ch0 FPGALINK1=fpga-0004:acl0:ch1-fpga-0004:acl1:ch1 FPGALINK2=fpga-0004:acl0:ch2-fpga-0004:acl1:ch2 FPGALINK3=fpga-0004:acl0:ch3-fpga-0004:acl1:ch3
We recommend using srun and sbatch, because this information is not automatically shown when using salloc (the configuration itself still works). When using salloc, you can still recover the information and setup your environment variables by invoking
source /opt/cray/slurm/default/etc/scripts/SAllocTopologyInfo.sh
Predefined topologies
As it can be tedious and error-prone to define each connection manually, we also provide a set of predefined topologies to be requested. The following table summarizes the available options.
Topology type | Invokation | Min-Max number of nodes | Brief description |
---|---|---|---|
pair | --fpgalink="pair" | 1-N | Pairwise connect the 2 FPGAs within each node |
clique | --fpgalink="clique" | 2 | All-to-all connection for 2 nodes, 4 FPGAs |
ring | --fpgalink="ringO" | 1-N | Ring with two links per direction, acl0 down, acl1 up |
--fpgalink="ringN" | 1-N | Ring with two links per direction, acl0 down, acl1 down | |
--fpgalink="ringZ" | 1-N | Ring with two links per direction, acl0 and acl1 neighbors | |
torus | --fpgalink="torus2" | 1-N | Torus with 2 FPGAs per row |
--fpgalink="torus3" | 2-N | Torus with 3 FPGAs per row | |
--fpgalink="torus4" | 2-N | Torus with 4 FPGAs per row | |
--fpgalink="torus5" | 3-N | Torus with 5 FPGAs per row | |
--fpgalink="torus6" | 3-N | Torus with 6 FPGAs per row |
Pair topology
Within each node, all channels of one FPGA board are connected to the respective channel of the other FPGA board. No connections between nodes are made.
srun -p fpga -A pc2-mitarbeiter --constraint=19.2.0_max -N 3 --fpgalink=pair --pty bash ... Summarizing most recent topology information and exporting FPGALINK variables: Host list fpga-0001 fpga-0002 fpga-0003 Pair topology Generated connections FPGALINK0=fpga-0001:acl0:ch0-fpga-0001:acl1:ch0 FPGALINK1=fpga-0001:acl0:ch1-fpga-0001:acl1:ch1 FPGALINK2=fpga-0001:acl0:ch2-fpga-0001:acl1:ch2 FPGALINK3=fpga-0001:acl0:ch3-fpga-0001:acl1:ch3 FPGALINK4=fpga-0002:acl0:ch0-fpga-0002:acl1:ch0 FPGALINK5=fpga-0002:acl0:ch1-fpga-0002:acl1:ch1 FPGALINK6=fpga-0002:acl0:ch2-fpga-0002:acl1:ch2 FPGALINK7=fpga-0002:acl0:ch3-fpga-0002:acl1:ch3 FPGALINK8=fpga-0003:acl0:ch0-fpga-0003:acl1:ch0 FPGALINK9=fpga-0003:acl0:ch1-fpga-0003:acl1:ch1 FPGALINK10=fpga-0003:acl0:ch2-fpga-0003:acl1:ch2 FPGALINK11=fpga-0003:acl0:ch3-fpga-0003:acl1:ch3 Topology configuration request accepted after 0.297791957855s
Clique topology
Within a pair of 2 nodes, each of the 4 FPGAs is connected to all 3 other FPGAs. Channel 0: to the same FPGA in the other node; channel 1: to the other FPGA in the same node; channel 2: to the other FPGA in the other node.
srun -p fpga -A pc2-mitarbeiter --constraint=19.2.0_max -N 2 --fpgalink=clique --pty bash ... Summarizing most recent topology information and exporting FPGALINK variables: Host list fpga-0013 fpga-0014 Clique topology Generated connections FPGALINK0=fpga-0013:acl0:ch0-fpga-0014:acl0:ch0 FPGALINK1=fpga-0013:acl1:ch0-fpga-0014:acl1:ch0 FPGALINK2=fpga-0013:acl0:ch1-fpga-0013:acl1:ch1 FPGALINK3=fpga-0014:acl0:ch1-fpga-0014:acl1:ch1 FPGALINK4=fpga-0013:acl0:ch2-fpga-0014:acl1:ch2 FPGALINK5=fpga-0013:acl1:ch2-fpga-0014:acl0:ch2 FPGALINK6=fpga-0013:acl0:ch3-fpga-0014:acl1:ch3 FPGALINK7=fpga-0013:acl1:ch3-fpga-0014:acl0:ch3
Ring topology
This setup puts all FPGAs in a ring topology that defines for each FPGA the neighbor FPGAs "north" and "south". It connects each FPGA's channels 0 and 2 to the "north" direction and channels 1 and 3 to the "south" direction. Thus, the local perspective for each node within the topology is
// local view from FPGA "local" to neighbors "north" and "south" // ch0 and ch2 connect to neighbor "north" local:ch0 <-> north:ch1 local:ch2 <-> north:ch3 // ch1 and ch1 connect to neighbor "south" local:ch1 <-> south:ch0 local:ch3 <-> south:ch2
Three different variants define how the FPGAs are arranged into the ring
// --fpgalink="ringO" // ringO, going down in acl0 column and back up in acl1 column // Column from north to south, end connected back to start fpga-0001:acl0 fpga-0002:acl0 fpga-0003:acl0 fpga-0004:acl0 fpga-0004:acl1 fpga-0003:acl1 fpga-0002:acl1 fpga-0001:acl1 // --fpgalink="ringN" // ringN, going down in acl0 column then down in acl1 column // Column from north to south, end connected back to start fpga-0001:acl0 fpga-0002:acl0 fpga-0003:acl0 fpga-0004:acl0 fpga-0001:acl1 fpga-0002:acl1 fpga-0003:acl1 fpga-0004:acl1 // --fpgalink="ringZ" // ringZ, going down through nodes, zigzaging between acl0 and acl1 // Column from north to south, end connected back to start fpga-0001:acl0 fpga-0001:acl1 fpga-0002:acl0 fpga-0002:acl1 fpga-0003:acl0 fpga-0003:acl1 fpga-0004:acl0 fpga-0004:acl1
Full example for a ringO with 4 nodes
srun -p fpga -A pc2-mitarbeiter --constraint=19.2.0_max -N 4 --fpgalink=ringO --pty bash Summarizing most recent topology information and exporting FPGALINK variables: Host list fpga-0009 fpga-0010 fpga-0011 fpga-0012 Ring topology information: column from north to south, end connected back to start fpga-0009:acl0 fpga-0010:acl0 fpga-0011:acl0 fpga-0012:acl0 fpga-0012:acl1 fpga-0011:acl1 fpga-0010:acl1 fpga-0009:acl1 Generated connections FPGALINK0=fpga-0009:acl0:ch1-fpga-0010:acl0:ch0 FPGALINK1=fpga-0009:acl0:ch3-fpga-0010:acl0:ch2 FPGALINK2=fpga-0010:acl0:ch1-fpga-0011:acl0:ch0 FPGALINK3=fpga-0010:acl0:ch3-fpga-0011:acl0:ch2 FPGALINK4=fpga-0011:acl0:ch1-fpga-0012:acl0:ch0 FPGALINK5=fpga-0011:acl0:ch3-fpga-0012:acl0:ch2 FPGALINK6=fpga-0012:acl0:ch1-fpga-0012:acl1:ch0 FPGALINK7=fpga-0012:acl0:ch3-fpga-0012:acl1:ch2 FPGALINK8=fpga-0012:acl1:ch1-fpga-0011:acl1:ch0 FPGALINK9=fpga-0012:acl1:ch3-fpga-0011:acl1:ch2 FPGALINK10=fpga-0011:acl1:ch1-fpga-0010:acl1:ch0 FPGALINK11=fpga-0011:acl1:ch3-fpga-0010:acl1:ch2 FPGALINK12=fpga-0010:acl1:ch1-fpga-0009:acl1:ch0 FPGALINK13=fpga-0010:acl1:ch3-fpga-0009:acl1:ch2 FPGALINK14=fpga-0009:acl1:ch1-fpga-0009:acl0:ch0 FPGALINK15=fpga-0009:acl1:ch3-fpga-0009:acl0:ch2
Torus topology
This setup puts all FPGAs in a torus topology that defines for each FPGA the neighbor FPGAs "north", "south", "west", "east". It connects each FPGA's channel 0 to the "north" direction, channel 1 to the "south" direction, channel 2 to the "west" direction and channel 3 to the "east" direction. Thus, the local perspective for each node within the topology is
// local view from FPGA "local" to neighbors "north", "south", "west", "east" // ch0 connects to neighbor "north" local:ch0 <-> north:ch1 // ch1 connects to neighbor "south" local:ch1 <-> south:ch0 // ch2 connects to neighbor "west" local:ch2 <-> west:ch3 // ch3 connects to neighbor "east" local:ch3 <-> east:ch2
The torus topology can be instantiated with a configurable width, that is number of FPGAs that are connected in "west-east" direction. With an uneven width, FPGAs in the same node can belong to consecutive rows of the torus. The number of FPGAs gets rounded down to the biggest full torus for the given width. The following block illustrates 3 different torus topologies on nodes fpga-[0001-0005]
// --fpgalink="torus2" // Torus with width 2 and height 5 // Columns from north to south, rows from west to east, end connected back to start fpga-0001:acl0 - fpga-0001:acl1 fpga-0002:acl0 - fpga-0002:acl1 fpga-0003:acl0 - fpga-0003:acl1 fpga-0004:acl0 - fpga-0004:acl1 fpga-0005:acl0 - fpga-0005:acl1 // --fpgalink="torus3" // Torus with width 3 and height 3 // Columns from north to south, rows from west to east, end connected back to start fpga-0001:acl0 - fpga-0001:acl1 - fpga-0002:acl0 fpga-0002:acl1 - fpga-0003:acl0 - fpga-0003:acl1 fpga-0004:acl0 - fpga-0004:acl1 - fpga-0005:acl0 // --fpgalink="torus4" // Torus with width 4 and height 2 // Columns from north to south, rows from west to east, end connected back to start fpga-0001:acl0 - fpga-0001:acl1 - fpga-0002:acl0 - fpga-0002:acl1 fpga-0003:acl0 - fpga-0003:acl1 - fpga-0004:acl0 - fpga-0004:acl1
Full example for a torus4 with 8 nodes
srun -p fpga -A pc2-mitarbeiter --constraint=19.2.0_max -N 8 --fpgalink=torus4 --pty bash ... Summarizing most recent topology information and exporting FPGALINK variables: Host list fpga-0001 fpga-0002 fpga-0003 fpga-0004 fpga-0005 fpga-0006 fpga-0007 fpga-0008 Torus topology with width 4 and height 4 Torus topology information: columns from north to south, rows from west to east, end connected back to start fpga-0001:acl0 - fpga-0001:acl1 - fpga-0002:acl0 - fpga-0002:acl1 fpga-0003:acl0 - fpga-0003:acl1 - fpga-0004:acl0 - fpga-0004:acl1 fpga-0005:acl0 - fpga-0005:acl1 - fpga-0006:acl0 - fpga-0006:acl1 fpga-0007:acl0 - fpga-0007:acl1 - fpga-0008:acl0 - fpga-0008:acl1 Generated connections FPGALINK0=fpga-0001:acl0:ch1-fpga-0003:acl0:ch0 FPGALINK1=fpga-0001:acl0:ch3-fpga-0001:acl1:ch2 FPGALINK2=fpga-0001:acl1:ch1-fpga-0003:acl1:ch0 FPGALINK3=fpga-0001:acl1:ch3-fpga-0002:acl0:ch2 FPGALINK4=fpga-0002:acl0:ch1-fpga-0004:acl0:ch0 FPGALINK5=fpga-0002:acl0:ch3-fpga-0002:acl1:ch2 FPGALINK6=fpga-0002:acl1:ch1-fpga-0004:acl1:ch0 FPGALINK7=fpga-0002:acl1:ch3-fpga-0001:acl0:ch2 FPGALINK8=fpga-0003:acl0:ch1-fpga-0005:acl0:ch0 FPGALINK9=fpga-0003:acl0:ch3-fpga-0003:acl1:ch2 FPGALINK10=fpga-0003:acl1:ch1-fpga-0005:acl1:ch0 FPGALINK11=fpga-0003:acl1:ch3-fpga-0004:acl0:ch2 FPGALINK12=fpga-0004:acl0:ch1-fpga-0006:acl0:ch0 FPGALINK13=fpga-0004:acl0:ch3-fpga-0004:acl1:ch2 FPGALINK14=fpga-0004:acl1:ch1-fpga-0006:acl1:ch0 FPGALINK15=fpga-0004:acl1:ch3-fpga-0003:acl0:ch2 FPGALINK16=fpga-0005:acl0:ch1-fpga-0007:acl0:ch0 FPGALINK17=fpga-0005:acl0:ch3-fpga-0005:acl1:ch2 FPGALINK18=fpga-0005:acl1:ch1-fpga-0007:acl1:ch0 FPGALINK19=fpga-0005:acl1:ch3-fpga-0006:acl0:ch2 FPGALINK20=fpga-0006:acl0:ch1-fpga-0008:acl0:ch0 FPGALINK21=fpga-0006:acl0:ch3-fpga-0006:acl1:ch2 FPGALINK22=fpga-0006:acl1:ch1-fpga-0008:acl1:ch0 FPGALINK23=fpga-0006:acl1:ch3-fpga-0005:acl0:ch2 FPGALINK24=fpga-0007:acl0:ch1-fpga-0001:acl0:ch0 FPGALINK25=fpga-0007:acl0:ch3-fpga-0007:acl1:ch2 FPGALINK26=fpga-0007:acl1:ch1-fpga-0001:acl1:ch0 FPGALINK27=fpga-0007:acl1:ch3-fpga-0008:acl0:ch2 FPGALINK28=fpga-0008:acl0:ch1-fpga-0002:acl0:ch0 FPGALINK29=fpga-0008:acl0:ch3-fpga-0008:acl1:ch2 FPGALINK30=fpga-0008:acl1:ch1-fpga-0002:acl1:ch0 FPGALINK31=fpga-0008:acl1:ch3-fpga-0007:acl0:ch2
Legacy topology setup
The following setup is completely replaced by the user configurable job specific topologies. It currently is kept here for reference about earlier measurements.
Some FPGA boards are connected with direct point-to-point connections that are abstracted in the OpenCL environment as Serial Channels. A status reg value of 0xfff11ff1 in the diagnose indicates an active connection. Currently available connections documented as
<nodename>:<devicename>:<channelname> with <nodename> in fpga-0001 -- fpga-0016 <devicename> in acl0, acl1 <channelname> in ch0, ch1, ch2, ch3
Islands with 4 FPGAs
fpga-0010 + fpga-0011
// four FPGAs with all-to-all connections // ch0 realizes vertical connections fpga-0010:acl0:ch0 <-> fpga-0011:acl0:ch0 fpga-0010:acl1:ch0 <-> fpga-0011:acl1:ch0 // ch1 realizes horizontal connections fpga-0010:acl0:ch1 <-> fpga-0010:acl1:ch1 fpga-0011:acl0:ch1 <-> fpga-0011:acl1:ch1 // ch2 realizes diagonal connections fpga-0010:acl0:ch2 <-> fpga-0011:acl1:ch2 fpga-0010:acl1:ch2 <-> fpga-0011:acl0:ch2
fpga-0015 + fpga-0016
// four FPGAs with all-to-all connections // ch0 realizes vertical connections fpga-0015:acl0:ch0 <-> fpga-0016:acl0:ch0 fpga-0015:acl1:ch0 <-> fpga-0016:acl1:ch0 // ch1 realizes horizontal connections fpga-0015:acl0:ch1 <-> fpga-0015:acl1:ch1 fpga-0016:acl0:ch1 <-> fpga-0016:acl1:ch1 // ch2 realizes diagonal connections fpga-0015:acl0:ch2 <-> fpga-0016:acl1:ch2 fpga-0015:acl1:ch2 <-> fpga-0016:acl0:ch2
Nodes with internal connections of 2 FPGAs
fpga-0012
// four connections from one board to the other fpga-0012:acl0:ch0 <-> fpga-0012:acl1:ch0 fpga-0012:acl0:ch1 <-> fpga-0012:acl1:ch1 fpga-0012:acl0:ch2 <-> fpga-0012:acl1:ch2 fpga-0012:acl0:ch3 <-> fpga-0012:acl1:ch3
fpga-0013
// four connections from one board to the other fpga-0013:acl0:ch0 <-> fpga-0013:acl1:ch0 fpga-0013:acl0:ch1 <-> fpga-0013:acl1:ch1 fpga-0013:acl0:ch2 <-> fpga-0013:acl1:ch2 fpga-0013:acl0:ch3 <-> fpga-0013:acl1:ch3
fpga-0014
// two connections from one board to the other fpga-0014:acl0:ch0 <-> fpga-0014:acl1:ch0 fpga-0014:acl0:ch3 <-> fpga-0014:acl1:ch3
A torus connecting 5 nodes and 10 FPGAs
14 August 2019: Torus updated from 4 to 5 nodes
The topology forms 2 colums that span all nodes and 4 rows that connect the FPGAs within each node.
fpga-0005:acl0 - fpga-0005:acl1 fpga-0006:acl0 - fpga-0006:acl1 fpga-0007:acl0 - fpga-0007:acl1 fpga-0008:acl0 - fpga-0008:acl1 fpga-0009:acl0 - fpga-0009:acl1 // e.g. // the "north" neighbor of fpga-0005:acl0 is fpga-0009:acl0 (wrap around) // the "south" neighbor of fpga-0005:acl0 is fpga-0006:acl0 // the "west" neighbor of fpga-0005:acl0 is fpga-0005:acl1 (wrap around) // the "east" neighbor of fpga-0005:acl0 is fpga-0005:acl1
The local view of connections as seen from within a local node is as follows.
// local view from FPGA "local" to neighbors "north", "south", "west", "east" // ch0 connects to neighbor "north" local:ch0 <-> north:ch1 // ch1 connects to neighbor "south" local:ch1 <-> south:ch0 // ch2 connects to neighbor "west" local:ch2 <-> west:ch3 // ch3 connects to neighbor "east" local:ch3 <-> east:ch2
The complete set of connections is as follows
fpga-0005:acl0:ch1 <-> fpga-0006:acl0:ch0 fpga-0005:acl0:ch3 <-> fpga-0005:acl1:ch2 fpga-0005:acl1:ch1 <-> fpga-0006:acl1:ch0 fpga-0005:acl1:ch3 <-> fpga-0005:acl0:ch2 fpga-0006:acl0:ch1 <-> fpga-0007:acl0:ch0 fpga-0006:acl0:ch3 <-> fpga-0006:acl1:ch2 fpga-0006:acl1:ch1 <-> fpga-0007:acl1:ch0 fpga-0006:acl1:ch3 <-> fpga-0006:acl0:ch2 fpga-0007:acl0:ch1 <-> fpga-0008:acl0:ch0 fpga-0007:acl0:ch3 <-> fpga-0007:acl1:ch2 fpga-0007:acl1:ch1 <-> fpga-0008:acl1:ch0 fpga-0007:acl1:ch3 <-> fpga-0007:acl0:ch2 fpga-0008:acl0:ch1 <-> fpga-0009:acl0:ch0 fpga-0008:acl0:ch3 <-> fpga-0008:acl1:ch2 fpga-0008:acl1:ch1 <-> fpga-0009:acl1:ch0 fpga-0008:acl1:ch3 <-> fpga-0008:acl0:ch2 fpga-0009:acl0:ch1 <-> fpga-0006:acl0:ch0 fpga-0009:acl0:ch3 <-> fpga-0009:acl1:ch2 fpga-0009:acl1:ch1 <-> fpga-0006:acl1:ch0 fpga-0009:acl1:ch3 <-> fpga-0009:acl0:ch2
Troubleshooting
CL_INVALID_PROGRAM_EXECUTABLE with fast emulation
When using the fast emulator along with host code that was previously tested with the legacy emulator and/or hardware execution, you may encounter a problem with during execution that corresponds to the OpenCL error code CL_INVALID_PROGRAM_EXECUTABLE. To fix this issue, your host code needs to invoke clBuildProgram (C API) or program.build() (C++ API). This invocation is required for any normal OpenCL code, but with legacy emulation and hardware execution, it was not required and could be skipped.
FPGA programmed with bitstreams built with different SDK versions in the same session
Error message during bitstream programming from host code or with aocl program
FAILED to read auto-discovery string at byte 2: Expected version is 19, found 20 Error: The currently programmed/flashed design is no longer supported in this release. Please recompile the design with the present version of the SDK and re-program/flash the board. acl_hal_mmd.c:1460:assert failure: Failed to initialize kernel interfacemain: acl_hal_mmd.c:1460: l_try_device: Assertion `0' failed.
This or similar error messages come up when invoking host code or aocl commands after a bitstream that was built with an earlier SDK version was configured. Workaround:
- Load the latest intelFPGA_pro module (e.g. 19.3.0)
- Configure the target bitstream (e.g. built with 19.2.0 SDK) using aocl program or your OpenCL host code
- Optionally [reload the target intelFPGA_pro module that was used when building the bitstream]
LOCALE settings forwarded from your computer
Error message in quartus_sh_compile.log
Internal Error: Sub-system: CFG_INI, File: /quartus/ccl/cfg_ini/cfg_ini_reader.cpp, Line: 1530 Couldn't parse ini setting qspc_nldm_max_step_size=10.0 as a floating point value Stack Trace: 0xb4fe: err_report_internal_error(char const*, char const*, char const*, int) + 0x1a (ccl_err) 0x17b45: cfg_get_double_value(std::string const&, double) + 0xe4 (ccl_cfg_ini) 0x8f788: CFG_INI_DOUBLE::refresh() + 0x48 (tsm_qspc) ... Error (23035): Tcl error: couldn't open "top.fit.rpt": no such file or directory while executing "open $report" (procedure "fetch_pseudo_panel" line 3) invoked from within "fetch_pseudo_panel $report "Found \[0-9\]* clocks" {1 0} 2" (procedure "fetch_clock_periods" line 6) invoked from within "fetch_clock_periods $report" (procedure "fetch_clock" line 2) invoked from within "fetch_clock "$revision_name.fit.rpt" $clkname" (procedure "get_fmax_from_report" line 8) invoked from within "get_fmax_from_report $k_clk_name 1 $recovery_multicycle $iteration" (procedure "get_kernel_clks_and_fmax" line 5) invoked from within "get_kernel_clks_and_fmax $k_clk_name $k_clk2x_name $recovery_multicycle $iteration" (file "/cm/shared/opt/intelFPGA_pro/19.4.0/hld/ip/board/bsp/adjust_plls.tcl" line 815) invoked from within "source "$sdk_root/ip/board/bsp/adjust_plls.tcl"" (file "scripts/post_flow_pr.tcl" line 59) Error (23031): Evaluation of Tcl script scripts/post_flow_pr.tcl unsuccessful ... Error: Quartus Fitter has failed! Breaking execution... Error (23035): Tcl error: while executing "qexec "quartus_cdb -t scripts/post_flow_pr.tcl \"$top_path\""" invoked from within "if {$revision_name eq "top"} { post_message "Compiling top revision..." # Load OpenCL BSP utility functions source "$sdk_root/ip/board/bsp/ope..." (file "compile_script.tcl" line 40) Error (23031): Evaluation of Tcl script compile_script.tcl unsuccessful Error: Quartus Prime Compiler Database Interface was unsuccessful. 3 errors, 0 warnings Error: Peak virtual memory: 1021 megabytes Error: Processing ended: Mon Mar 30 14:47:15 2020 Error: Elapsed time: 03:06:43 Error: System process ID: 21428
The root cause is outlined in the first message of the above excpert from quartus_sh_compile.log: parsing of a number as floating point failed. This can be caused by locale settings that are transferred from the computer you connect with to Noctua. After connecting to Noctua, check your locale settings with locale, and possibly change them with export LC_NUMERIC="en_US.UTF-8"
[tester@fe-1 matrix_mult]$ locale ... LC_NUMERIC="de_DE.UTF-8" // can cause above error [tester@fe-1 matrix_mult]$ export LC_NUMERIC="en_US.UTF-8" [tester@fe-1 matrix_mult]$ locale ... LC_NUMERIC="en_US.UTF-8" // known to work ...