Noctua FPGA Usage, Integration and Development

Aus PC2 Doc
(Weitergeleitet von Noctua-FPGA)
Wechseln zu: Navigation, Suche

Updates

2021

  • 7 Jan
    • Installed new Intel FPGA SDK for OpenCL 20.4.0
    • Added reference to Intel design examples

2020

  • 4 Dec
    • Updated BSP name from legacy nalla_pcie to bittware_520n. On Noctua, now both modules are available to keep existing scripts working.
    • Small updates to documentation
    • Small updates to BSP module files
  • 29 Oct
    • Installed new Intel FPGA SDK for OpenCL 20.3.0
    • Small updates to module files reflect changed structure of binaries inside the SDK
  • 26 Jul
    • Installed new Intel FPGA SDK for OpenCL 20.2.0
  • 17. Apr
  • 16. Apr
    • Installed new Intel FPGA SDK for OpenCL 20.1.0
    • New documenatition for compatibility matrix of BSP and SDK version.
  • 24. Mar
  • 06. Mar
    • Extended description on host code linkage and platform selection for hardware execution and emulation.
  • 08. Jan
    • Fast emulation for Intel FPGA SDK for OpenCL 19.3.0 and newer is now supported also on frontends, compute and synthesis nodes.

2019

  • 18. Dec
    • Installed new Intel FPGA SDK for OpenCL 19.4.0
    • Provided script to access custom topology information also when using salloc.
    • Created overview section and updated some details of the documentation.
  • 16. Dec
    • Start of update log for FPGA infrastructure and documentation.
    • Updated structure of module files for Bittware BSPs and Intel FPGA SDKs: export an environment variable FPGA_BOARD_NAME that you can use in your build system.
    • Updated documentation of emulation: fast emulation on now supported starting with 19.3.0 Intel FPGA SDK on fpga-nodes.

Overview

The Noctua FPGA partition is equipped with Bittware 520N cards with Intel FPGA Stratix 10 GX 2800. To execute or develop FPGA designs for this partition, two components are required, which both are available in different versions:

  • A Bittware board support package (BSP) including an FPGA base bitstream and drivers.
  • The Intel FPGA SDK for OpenCL with further documentation by Intel. Relevant documentation includes in particular
    • Intel FPGA SDK for OpenCL Programming Guide
    • Intel FPGA SDK for OpenCL Best Practices Guide

For hardware execution, fpga-nodes with the fitting BSP version need to be allocated, see Section FPGA System Access. The software environment is setup using modules, see Section Setting Up the FPGA Software Environment. During the early operations phase, Bittware BSPs versions and Intel FPGA SDK versions were deployed in lockstep, but now BSPs can also be used with newer SDK versions.

Applications with Integrated FPGA Acceleration

Application Type of Support
CP2K for DFT with FPGA Acceleration of the Submatrix Method Ready-to-use module files and bitstreams deployed on Noctua
CP2K for DFT with FPGA Acceleration of 3D FFTs FPGA support in CP2K main repository + extra repository with FPGA designs fitting the Bittware 520N cards in Noctua
HPCC FPGA: HPC Challenge Benchmark Suite for FPGAs Repository with benchmark suite targeting FPGAs from Intel (including the Bittware 520N with Stratix 10 cards in Noctua) and Xilinx
Cannon Matrix Multiplication on FPGAs Repository with implementation of Cannon matrix multiplication as building block for GEMM on FPGAs fitting the Bittware 520N cards in Noctua

Intel Design Examples

The latest Intel examples are shipped with the Intel FPGA SDK for OpenCL. On Noctua:

[user@fpga-0017 examples_aoc]$ module load intelFPGA_pro
[user@fpga-0017 examples_aoc]$ cd $INTELFPGAOCLSDKROOT/examples_aoc
[user@fpga-0017 examples_aoc]$ ls
asian_option  compression       extlibs  fft1d_offchip  jpeg_decoder      library_hls_sot      loopback_hostpipe  matrix_mult                   optical_flow  vector_add
channelizer   compute_score     fd3d     fft2d          library_example1  library_matrix_mult  Makefile           multithread_vector_operation  sobel_filter  video_downscaling
common        double_buffering  fft1d    hello_world    library_example2  local_memory_cache   mandelbrot         n_way_buffering               tdfir         web

You can copy any interesting example to your working directory.

FPGA System Access and BSP Selection

To use Noctua nodes with FPGAs, along with your Slurm command, you need to select the partition and provide a constraint to specify the version of the Bittware board support package (BSP) that your designs have been built for. E.g.

srun --partition=fpga --constraint=19.4.0_max

Constraints can be used together with srun, sbatch and salloc, however under some conditions salloc will fail, for details refer to the end of this section. We recommend to always use srun or sbatch.

The list of supported board support package (BSP) versions is.

19.4.0_max
19.4.0_hpc
19.2.0_max
19.2.0_hpc
19.1.0_max
18.1.1
18.1.1_hpc
18.0.1
18.0.0

A list of available matching versions of the Intel(R) FPGA SDK for OpenCL(TM) can be found here .

For hardware execution on the FPGA nodes, you always must specify the job constraint that fits the BSP version you used to synthesize the bitstream (.aocx file). For details refer to the next subsection. If you see a message like the one below, you have used the wrong constraint, which can impact your own results or make the machine unavailable for subsequent jobs.

MMD INFO : [aclbitt_s10_pcie1] Quartus versions for base and import compile do not match
MMD INFO : [aclbitt_s10_pcie1] Board is currently programmed with sof from Quartus 19.4.0 64
MMD INFO : [aclbitt_s10_pcie1] PR import was compiled with Quartus 19.2.0 57

When designing and creating new FPGA bitstreams, we recommend to always use the latest version. Along with stability improvements, the 19.2.0 BSPs also allow for up to ~100MHz higher clock frequencies. BSP versions prior to 19.2.0 may become deprecated in the future. Use older BSP versions only to reuse existing bitstreams synthesized earlier.

For the 19.2.0 and 18.1.1 tools, there are two versions of the BSP that target the same board. The 19.2.0_max BSP enables the external serial channels and thus offers the default functionality for our setup. The 19.2.0_hpc BSP does not offer external serial channels, but may enable higher clock frequencies for this tool version.

Identifying BSP versions of existing bitstreams

You can find out the constraint required for an existing bitstream using two commands of the aocl binedit tool, e.g.

module load intelFPGA_pro
aocl binedit build/19_3/krn_auto/volume_dummy_v10.aocx print .acl.board 
aocl binedit build/19_3/krn_auto/volume_dummy_v10.aocx print .acl.board_package

Finding the output values of aocl binedit in the first two columns allows you to identify the contraint to be used.

.acl.board .acl.board_package --constraint
p520_max_sg280l /cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10 19.4.0_max
/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10 19.2.0_max
/cm/shared/opt/intelFPGA_pro/19.1/hld/board/bittware_pcie/s10 19.1.0
/opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie 18.1.1_max
/opt/intelFPGA_pro/18.0.1/hld/board/nalla_pcie 18.0.1
/opt/intelFPGA_pro/18.0.0/hld/board/nalla_pcie 18.0.0
p520_hpc_sg280l /cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10 19.4.0_hpc
/cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10_hpc_default 19.4.0_hpc
/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10 19.2.0_hpc
/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10_hpc_default 19.2.0_hpc
/opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie 18.1.1_hpc
p520_max_sg280h unsupported
p520_hpc_sg280h unsupported

Unsupported BSPs: Since the 19.1.0 tools, the BSP comes with support for two different target boards: p520_max_sg280l and p520_max_sg280h. These differentiate between boards with so-called L-Tile and H-Tile FPGAs. Our boards contain L-Tile FPGAs, so only use p520_max_sg280l (and p520_hpc_sg280l for selected BSP versions) as targets for synthesis.

Issues with salloc

Constraints can be specified together with srun, sbatch and salloc. However, salloc only works well when the constraints are already satisfied by available nodes. We recommend to always use srun or sbatch.

A problem occurs when one of the nodes to be allocated is configured for a different constraint and is currently in use. Then salloc will fail with the following error message.

salloc: error: Job submit/allocate failed: Requested node configuration is not available

Workaround: use an allocation without requesting specific node names.

A problem also occurs when one of the nodes to be allocated is configured for a different constraint and is currently free. The allocation succeeds while the nodes are still reconfigured, programs or scripts starting during this time will fail, actual errors encountered differ.

Setting Up the FPGA Software Environment

Designing, emulating and executing FPGA designs requires a common software environment. The Bittware (formerly Nallatech) BSP (including drivers) and the Intel(R) FPGA SDK for OpenCL(TM) are setup using modules with the respective commands:

module load bittware_520n
module load intelFPGA_pro 

For any actual FPGA usage, the version of the BSP module (bittware_520n) version must match the node version allocated with the constraint. By default, the latest supported version is loaded. To use a specific version, you can append the version number like this:

module load bittware_520n/19.4.0_hpc intelFPGA_pro/20.3.0

Matching SDK versions and BSP versions

Bittware BSP modules (bittware_520n) can always be used with Intel(R) FPGA SDK for OpenCL(TM)] modules (intelFPGA_pro) with the same and newer version numbers up to one major version update. For hardware execution, the bittware_520n BSP must always match the allocated constraint. The following combinations of bittware_520n BSPs and intelFPGA_pro SDKs are supported:

Compatibility matrix of BSP (bittware_520n) and SDK (intelFPGA_pro) versions
bittware_520n modules (make sure to match the allocated constraint)
19.4.0_max/_hpc 19.2.0_max/_hpc 19.1.0 18.1.1_max/_hpc 18.0.1 18.0.0
intelFPGA_pro modules 20.4.0 yes, recommended yes yes
20.3.0 yes yes yes
20.2.0 yes yes yes
20.1.0 yes yes yes
19.4.0(_max/_hpc) yes yes yes yes yes yes
19.3.0 yes yes yes yes yes
19.2.0(_max/_hpc) yes yes yes yes yes
19.1.0 yes yes yes yes
18.1.1(_max/_hpc) yes yes yes
18.0.1 yes yes
18.0.0 yes

When designing and creating new FPGA bitstreams, we recommend to always use the latest version. Along with stability improvements, the 19.4.0 and 19.2.0 BSPs also allow for up to ~100MHz higher clock frequencies than previous versions. BSP versions prior to 19.4.0 may become deprecated in the future. Use older BSP versions only to reuse existing bitstreams synthesized earlier.

Required gcc versions

When you link to the Intel FPGA OpenCL runtime, as well as for building most current examples, you have to use a newer compiler than the one that is loaded by default (gcc 4.8). The following modules have successfully been tested

module load compiler/GCCcore/8.2.0
module load compiler/GCCcore/7.3.0
module load compiler/GCCcore/6.4.0
module load compiler/GCCcore/5.4.0
module load gcc/7.2.0
module load gcc/6.1.0

Since the 19.3 release of the Intel FPGA SDK Intel has documented the underlying requirements for a libstdc++ in the LD_LIBRARY_PATH, that gets fulfilled with the documented modules.

Tool Info and Board Name

You can confirm the used tool version and find out the name of the target platform that you need to build hardware designs like this.

[tester@fe-1 ~]$ aoc -version
Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler
Version 19.1.0 Build 240 Pro Edition
Copyright (C) 2019 Intel Corporation

[tester@fe-1 ~]$ aoc -list-boards
Board list:
  p520_max_sg280h
     Board Package: /cm/shared/opt/intelFPGA_pro/19.1/hld/board/bittware_pcie/s10
     Channels:      kernel_input_ch0, kernel_output_ch0, kernel_input_ch1, kernel_output_ch1, kernel_input_ch2, kernel_output_ch2, kernel_input_ch3, kernel_output_ch3

  p520_max_sg280l
     Board Package: /cm/shared/opt/intelFPGA_pro/19.1/hld/board/bittware_pcie/s10
     Channels:      kernel_input_ch0, kernel_output_ch0, kernel_input_ch1, kernel_output_ch1, kernel_input_ch2, kernel_output_ch2, kernel_input_ch3, kernel_output_ch3

or

[tester@fe-1 matrix_mult]$ aoc -version
Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler
Version 18.1.1 Build 263 Pro Edition
Copyright (C) 2018 Intel Corporation

[tester@fe-1 matrix_mult]$ aoc -list-boards
Board list:
  p520_hpc_sg280l
     Board Package: /opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie

  p520_max_sg280l
     Board Package: /opt/intelFPGA_pro/18.1.1/hld/board/nalla_pcie
     Channels:      kernel_input_ch0, kernel_output_ch0, kernel_input_ch1, kernel_output_ch1, kernel_input_ch2, kernel_output_ch2, kernel_input_ch3, kernel_output_ch3

The tools are available on all nodes and frontends.

Sanity check: Actual Hardware on Noctua FPGA Node

You can perform a status and sanity check on the actual FPGA hardware like this. This test does not work on the frontends or compute nodes, but only on the fpga-nodes (fpga-0001 to fpga-0016).

[tester@fpga-0006 ~]$ aocl diagnose
--------------------------------------------------------------------
ICD System Diagnostics                                              
--------------------------------------------------------------------

Using the following location for ICD installation: 
	/etc/OpenCL/vendors

Found 4 icd entry at that location:
	/etc/OpenCL/vendors/intel-cpu.icd
	/etc/OpenCL/vendors/intel-neo.icd
	/etc/OpenCL/vendors/Intel_FPGA_SSG_Emulator.icd
	/etc/OpenCL/vendors/Altera.icd

The following OpenCL libraries are referenced in the icd files:
	libintelocl.so
	libigdrcl.so
	libintelocl_emu.so
	libalteracl.so

Checking LD_LIBRARY_PATH for registered libraries:
	libalteracl.so was registered on the system at /cm/shared/opt/intelFPGA_pro/20.3.0/hld/host/linux64/lib

Using the following location for fcd installations:
	/opt/Intel/OpenCL/Boards

Found 1 fcd entry at that location:
	/opt/Intel/OpenCL/Boards/bitt_s10_pcie.fcd

The following OpenCL libraries are referenced in the fcd files:
	/cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10/linux64/lib/libbitt_s10_pcie_mmd.so

Checking LD_LIBRARY_PATH for registered libraries:
	/cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10/linux64/lib/libbitt_s10_pcie_mmd.so was registered on the system.

Number of Platforms = 2 
	1. Intel(R) FPGA Emulation Platform for OpenCL(TM)              | Intel(R) Corporation           | OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
	2. Intel(R) FPGA SDK for OpenCL(TM)                             | Intel(R) Corporation           | OpenCL 1.0 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
--------------------------------------------------------------------
ICD diagnostics PASSED                                              
--------------------------------------------------------------------
--------------------------------------------------------------------
BSP Diagnostics                                                     
--------------------------------------------------------------------
--------------------------------------------------------------------
Device Name:
acl0

BSP Install Location:
/cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10

Vendor: BittWare ltd
MMD INFO : QSFP Module 0 : serial number : EO1190226 power class : 3
MMD INFO : QSFP Module 0 : Setting to high power mode
MMD INFO : QSFP Module 0 : qsfpid 0x0d : power class : 3 (power class register is 0x02) 
MMD INFO : QSFP Module 0 : temperature : 48.32 degC
MMD INFO : QSFP Module 1 : serial number : EO1190226 power class : 3
MMD INFO : QSFP Module 1 : Setting to high power mode
MMD INFO : QSFP Module 1 : qsfpid 0x0d : power class : 3 (power class register is 0x02) 
MMD INFO : QSFP Module 1 : temperature : 26.15 degC
MMD INFO : QSFP Module 2 : serial number : EO1190226 power class : 3
MMD INFO : QSFP Module 2 : Setting to high power mode
MMD INFO : QSFP Module 2 : qsfpid 0x0d : power class : 3 (power class register is 0x02) 
MMD INFO : QSFP Module 2 : temperature : 38.14 degC
MMD INFO : QSFP Module 3 : serial number : EO1190226 power class : 3
MMD INFO : QSFP Module 3 : Setting to high power mode
MMD INFO : QSFP Module 3 : qsfpid 0x0d : power class : 3 (power class register is 0x02) 
MMD INFO : QSFP Module 3 : temperature : 38.37 degC

Phys Dev Name      Status   Information

aclbitt_s10_pcie0  Passed   BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie0)
                            PCIe dev_id = 5170, sub_dev_id = 5204, bus:slot.func = 16:00.00, Gen3 x8
                            Card serial number = EO1190226.
                            FPGA temperature = 40 degrees C.
                            Total Card Power Usage = 67.1114 Watts.
                            Serial Channel 0 status reg = 0x0ff11ff1.
                            Serial Channel 1 status reg = 0x0ff11ff1.
                            Serial Channel 2 status reg = 0x0ff11ff1.
                            Serial Channel 3 status reg = 0x0ff11ff1.
                            Current PR ID reg = 0x3ccfa79a(1020241818).

DIAGNOSTIC_PASSED
--------------------------------------------------------------------
--------------------------------------------------------------------
Device Name:
acl1

BSP Install Location:
/cm/shared/opt/intelFPGA_pro/19.4.0/hld/board/bittware_pcie/s10

Vendor: BittWare ltd
MMD INFO : QSFP Module 0 : serial number : EO1190226 power class : 3
MMD INFO : QSFP Module 0 : Setting to high power mode
MMD INFO : QSFP Module 0 : qsfpid 0x0d : power class : 3 (power class register is 0x02) 
MMD INFO : QSFP Module 0 : temperature : 38.60 degC
MMD INFO : QSFP Module 1 : serial number : EO1190226 power class : 3
MMD INFO : QSFP Module 1 : Setting to high power mode
MMD INFO : QSFP Module 1 : qsfpid 0x0d : power class : 3 (power class register is 0x02) 
MMD INFO : QSFP Module 1 : temperature : 41.93 degC
MMD INFO : QSFP Module 2 : serial number : EO1190226 power class : 3
MMD INFO : QSFP Module 2 : Setting to high power mode
MMD INFO : QSFP Module 2 : qsfpid 0x0d : power class : 3 (power class register is 0x02) 
MMD INFO : QSFP Module 2 : temperature : 40.40 degC
MMD INFO : QSFP Module 3 : serial number : EO1190226 power class : 3
MMD INFO : QSFP Module 3 : Setting to high power mode
MMD INFO : QSFP Module 3 : qsfpid 0x0d : power class : 3 (power class register is 0x02) 
MMD INFO : QSFP Module 3 : temperature : 37.69 degC

Phys Dev Name      Status   Information

aclbitt_s10_pcie1  Passed   BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie1)
                            PCIe dev_id = 5170, sub_dev_id = 5204, bus:slot.func = af:00.00, Gen3 x8
                            Card serial number = EO1190226.
                            FPGA temperature = 38 degrees C.
                            Total Card Power Usage = 66.5158 Watts.
                            Serial Channel 0 status reg = 0x0ff11ff1.
                            Serial Channel 1 status reg = 0x0ff11ff1.
                            Serial Channel 2 status reg = 0x0ff11ff1.
                            Serial Channel 3 status reg = 0x0ff11ff1.
                            Current PR ID reg = 0x3ccfa79a(1020241818).

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices

Emulation of FPGA OpenCL kernels

Before synthesis and hardware execution, it is highly recommended to check the functionality of your OpenCL design in emulation. Between the 19.1.0 and the 20.2.0 version of the Intel FPGA SDK for OpenCL, two emulation modes are implemented, with compiler flags to aoc controlling the mode (in all modes -march=emulator is used to target emulation in the first place). For our Noctua setup, we recommend using fast emulation for Intel FPGA SDK for OpenCL 19.3.0 or newer. 19.2.0 is the only version where you need to explicitly use a compiler argument to get to the recommended mode: -legacy-emulator.

Intel FPGA SDK for OpenCL Versions Legacy Emulation Fast Emulation Recommended for Noctua
18.x.x default not available Legacy Emulation
19.1.0 default -fast-emulator Legacy Emulation
19.2.0 -legacy-emulator default Legacy Emulation
19.3.0 -legacy-emulator default Fast Emulation
19.4.0 -legacy-emulator default Fast Emulation
20.1.0 -legacy-emulator default Fast Emulation
20.2.0 -legacy-emulator default Fast Emulation
20.3.0 and later no longer supported default Fast Emulation

For setup of the host code for emulation, refer to the following section.

Host code setup for emulation and hardware execution

Linking your host code

For hardware execution, all link configurations obtained with aocl link-config are functional. For fast emulation, there are currently problems with the automatic link configuration on frontent and compute nodes.

Workaround 1

Build and test your host code on an fpga-node. You can use --constraint=emul if you don't need a specific BSP version for hardware tests.

Workaround 2

Manually set the right linkage options in your build process.

The configuration you want to get is for the respective tool versions:

[user@fpga-0005 ~]$ aocl link-config
-L/cm/shared/opt/intelFPGA_pro/19.3.0/hld/host/linux64/lib -lOpenCL
[user@fpga-0005 ~]$ aocl link-config
-L/cm/shared/opt/intelFPGA_pro/19.4.0/hld/host/linux64/lib -lOpenCL

If you instead see something like this, you need to manually change your build flow

[user@ln-0002 ~]$ aocl link-config
-L/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10/linux64/lib -L/cm/shared/opt/intelFPGA_pro/19.3.0/hld/host/linux64/lib -Wl,--no-as-needed -lalteracl -lbitt_s10_pcie_mmd -lelf
[user@ln-0002 ~]$ aocl link-config
-L/cm/shared/opt/intelFPGA_pro/19.2.0/hld/board/bittware_pcie/s10/linux64/lib -L/cm/shared/opt/intelFPGA_pro/19.4.0/hld/host/linux64/lib -Wl,--no-as-needed -lalteracl -lbitt_s10_pcie_mmd -lelf

You can check the correct linkage of your host code with ldd, e.g. ldd bin/host and compare with the outputs in the next section.

Selecting the right platform name in the host code

Many examples of OpenCL host codes either select the first OpenCL platform they find, or use a fixed identifier string to identify their target platform. When using fast emulation, you need to adapt your platform name inside the host code. Many actual host code examples convert the platform names to lower or upper case.

Execution mode Platform name to match in host code Required linker flag Linking sanity check with ldd
Hardware execution Intel(R) FPGA SDK for OpenCL(TM) -lOpenCL or -lalteracl any of the below versions
Fast emulation Intel(R) FPGA Emulation Platform for OpenCL(TM) -lOpenCL libOpenCL.so.1 => /cm/shared/opt/intelFPGA_pro/<QUARTUS_VERSION>/hld/host/linux64/lib/libOpenCL.so.1 (0x00002aaaaacba000)
Legacy emulation Intel(R) FPGA SDK for OpenCL(TM) -lalteracl libalteracl.so => /cm/shared/opt/intelFPGA_pro/<QUARTUS_VERSION>/hld/host/linux64/lib/libalteracl.so

Hardware Builds (full synthesis) of FPGA OpenCL kernels

The full hardware builds for FPGA OpenCL kernels will take at least several hours. You can use the fpgasyn, long or batch partitions to submit hardware build jobs. Only the fpgasyn partition allows to share a node for multiple synthesis jobs, depending on your estimated main memory usage during synthesis. Below we provide a list with rough guidlines about the hardware build times to expect. With regard to time limits, you typically want to provide generous headroom (e.g. 4x, up to the full 3 day limit of fpgasyn) in your job submission in order to avoid automatic cancellation of a job that still could have finished. For memory limit, it can be more beneficial to provide estimates closer to the real utilization, in order to leave space for other jobs on the node.

Rough estimates for hardware build times and memory limits for different design types
Design properties Estimated time Estimated memory Suggested arguments for fpgasyn partition
Low resource utilization (<10% in Kernel System)

Simple memory interface (Global interconnect for < 10 global loads + stores)
Loops with low to medium latency (<500 cycles)

2-4h 45 GB --mem=45000MB --time=8:00:00
Medium resource utilization (<40% ALUTs and FFs, and <60% RAMs and DSPs in Kernel System)

Simple to medium memory interface (Global interconnect for < 20 global loads + stores)
Loops with low to medium latency (<500 cycles)

8-12h 60-90 GB --mem=90000MB --time=24:00:00
High resource utilization (>50% ALUTs and FFs, or >70% RAMs and DSPs in Kernel System)

Simple to medium memory interface (Global interconnect for < 20 global loads + stores)
Loops with low to medium latency (<500 cycles)

12-20h 90-120 GB --mem=120000MB --time=48:00:00
Any resource utilization

Simple to medium memory interface (Global interconnect for > 100 global loads + stores)
or Loops with high to very high latency (>2000 cycles)

30-60h 120+ GB --exclusive --time=72:00:00

Moreover, the synthesis operations generate several GBs of result files. So, it is advisable to run kernel code synthesis in a directory on the $PC2PFS parallel file system for performance and capacity reasons.

For some tool versions (Intel FPGA SDK for OpenCL and/or Bittware BSP) there are specific optimization recommendations:

All current Bittware BSPs, starting with 18.1.1 (18.1.1_max and 18.1.1_hpc)

For most designs, it is recommended for kernel compilations to use the additional options -global-ring -duplicate-ring. These change the external memory interconnect structure between kernels and the base region to a structure that is optimized for Stratix-10 silicon. This has been seen to improve kernel fmax for BSPs with several banks of DDR such as the 520. Since the 19.1 Intel FPGA SDK for OpenCL software release, Intel has documented these settings in the AOCL Programming Guide and AOCL Best Practices Guide.

For example, to compile the vector addition example:

aoc -board=p520_max_sg280l vector_add.cl -global-ring -duplicate-ring

18.1.1 Intel FPGA SDK for OpenCL Known Issue

Under some circumstances, the 18.1.1 SDK can generate designs that contain severe timing violations. Don't use the generated bitstreams.

Other tool versions will only generate bitstreams with minor timing violations that are generally ok to execute in hardware. For severe timing violations, they don't generate a bitstream, but rather report routing failures.

Indicating warnings in aocl console output look as follows.

Warning: hold time violation of -0.232 ns on clock:

  board_inst|kernel_clk_gen|kernel_clk_gen|kernel_pll_outclk0   

Corresponding outputs should show up in quartus_sh_compile.log and top.failing_clocks.rpt as

Info: Clock domains failing timing
+--------+---------------+-----------------------------------------------------------------+------------------------+-----------------+
; Slack  ; End Point TNS ; Clock                                                           ; Operating conditions   ; Timing analysis ;
+--------+---------------+-----------------------------------------------------------------+------------------------+-----------------+
; -0.429 ; -0.835        ; board_inst|mem_bank1|mem_emif_bank1|mem_emif_bank1_core_usr_clk ; Slow 900mV 100C Model  ; Setup           ;
; -0.347 ; -144.745      ; board_inst|kernel_clk_gen|kernel_clk_gen|kernel_pll_outclk0     ; Slow 900mV 0C Model    ; Hold            ;

18.0.1 Intel FPGA SDK for OpenCL

All compilations should use non-interleaving for a significant improvement in the kernel clock frequency. This is a limitation of the Intel 18.0.1 OpenCL compiler which is due to be resolved in the later compiler releases.

For example, to compile the vector addition example:

aoc -board=p520_max_sg280l -no-interleaving=default vector_add.cl

Local SSDs

The FPGA nodes contain a local SSD mounted as /tmp that can be used as additional scratch directory for performance critical local IO, e.g. for FPGA logging data.

Serial Channel Point to Point Connections between Boards

When configured with a p520_max_sg280l BSP, all FPGA boards offer 4 point-to-point connections to other FPGA boards. From the OpenCL environment, these are used as external serial channels. A status reg value of 0xfff11ff1 in the diagnose indicates an active connection. The topologies of these connections are fully configurable with each job allocation. There is a number of predefined topologies that can be selected with a shorthand notation like --fpgalink="pair", or you can provide a series of individual connection descriptions like --fpgalink="n00:acl0:ch0-n01:acl0:ch0".

Custom topologies

The notation nXX:aclY:chZ describes a unique serial channel endpoint within a job allocation according to the following pattern

  • nXX, e.g. n02 specifies the node ID within your allocation, starting with n00 for the first node, n02 will specify the third node of your allocation. You can not use higher node IDs than the number of nodes requested by the allocation. At allocation time, the node ID is translated to a concrete node name, e.g. fpga-0008.
  • aclY, e.i. acl0 and acl1 describe the first and second FPGA board within each node.
  • chZ, e.i. ch0, ch1, ch2 and ch3 describe the 4 external channel connections for each board.

By specifying one unique pair of serial channel endpoints per --fpgalink argument, an arbitrary topology can be created within a job allocation. When the task starts, the topology will be summarized and for each fpgalink, an environment variable will be exported. E.g.

srun -A pc2-mitarbeiter --constraint=19.2.0_max -N 1 --fpgalink="n00:acl0:ch0-n00:acl1:ch0" --fpgalink="n00:acl0:ch1-n00:acl1:ch1" --fpgalink="n00:acl0:ch2-n00:acl1:ch2" --fpgalink="n00:acl0:ch3-n00:acl1:ch3" -p fpga --pty bash
...
Summarizing most recent topology information and exporting FPGALINK variables:
Host list
fpga-0004
Generated connections
FPGALINK0=fpga-0004:acl0:ch0-fpga-0004:acl1:ch0
FPGALINK1=fpga-0004:acl0:ch1-fpga-0004:acl1:ch1
FPGALINK2=fpga-0004:acl0:ch2-fpga-0004:acl1:ch2
FPGALINK3=fpga-0004:acl0:ch3-fpga-0004:acl1:ch3

We recommend using srun and sbatch, because this information is not automatically shown when using salloc (the configuration itself still works). When using salloc, you can still recover the information and setup your environment variables by invoking

source /opt/cray/slurm/default/etc/scripts/SAllocTopologyInfo.sh

also when using srun or sbatch within your allocation, you can recover this information and even update the topology (careful, topologies are not restored to previous state within salloc).

Predefined topologies

As it can be tedious and error-prone to define each connection manually, we also provide a set of predefined topologies to be requested. The following table summarizes the available options.

Topology type Invokation Min-Max number of nodes Brief description
pair --fpgalink="pair" 1-N Pairwise connect the 2 FPGAs within each node
clique --fpgalink="clique" 2 All-to-all connection for 2 nodes, 4 FPGAs
ring --fpgalink="ringO" 1-N Ring with two links per direction, acl0 down, acl1 up
--fpgalink="ringN" 1-N Ring with two links per direction, acl0 down, acl1 down
--fpgalink="ringZ" 1-N Ring with two links per direction, acl0 and acl1 neighbors
torus --fpgalink="torus2" 1-N Torus with 2 FPGAs per row
--fpgalink="torus3" 2-N Torus with 3 FPGAs per row
--fpgalink="torus4" 2-N Torus with 4 FPGAs per row
--fpgalink="torus5" 3-N Torus with 5 FPGAs per row
--fpgalink="torus6" 3-N Torus with 6 FPGAs per row

Pair topology

Within each node, all channels of one FPGA board are connected to the respective channel of the other FPGA board. No connections between nodes are made.

srun -p fpga -A pc2-mitarbeiter --constraint=19.2.0_max -N 3 --fpgalink=pair --pty bash
...
Summarizing most recent topology information and exporting FPGALINK variables:
Host list
fpga-0001
fpga-0002
fpga-0003
Pair topology
Generated connections
FPGALINK0=fpga-0001:acl0:ch0-fpga-0001:acl1:ch0
FPGALINK1=fpga-0001:acl0:ch1-fpga-0001:acl1:ch1
FPGALINK2=fpga-0001:acl0:ch2-fpga-0001:acl1:ch2
FPGALINK3=fpga-0001:acl0:ch3-fpga-0001:acl1:ch3
FPGALINK4=fpga-0002:acl0:ch0-fpga-0002:acl1:ch0
FPGALINK5=fpga-0002:acl0:ch1-fpga-0002:acl1:ch1
FPGALINK6=fpga-0002:acl0:ch2-fpga-0002:acl1:ch2
FPGALINK7=fpga-0002:acl0:ch3-fpga-0002:acl1:ch3
FPGALINK8=fpga-0003:acl0:ch0-fpga-0003:acl1:ch0
FPGALINK9=fpga-0003:acl0:ch1-fpga-0003:acl1:ch1
FPGALINK10=fpga-0003:acl0:ch2-fpga-0003:acl1:ch2
FPGALINK11=fpga-0003:acl0:ch3-fpga-0003:acl1:ch3
Topology configuration request accepted after 0.297791957855s

Clique topology

Within a pair of 2 nodes, each of the 4 FPGAs is connected to all 3 other FPGAs. Channel 0: to the same FPGA in the other node; channel 1: to the other FPGA in the same node; channel 2: to the other FPGA in the other node.

srun -p fpga -A pc2-mitarbeiter --constraint=19.2.0_max -N 2 --fpgalink=clique --pty bash
...
Summarizing most recent topology information and exporting FPGALINK variables:
Host list
fpga-0013
fpga-0014
Clique topology
Generated connections
FPGALINK0=fpga-0013:acl0:ch0-fpga-0014:acl0:ch0
FPGALINK1=fpga-0013:acl1:ch0-fpga-0014:acl1:ch0
FPGALINK2=fpga-0013:acl0:ch1-fpga-0013:acl1:ch1
FPGALINK3=fpga-0014:acl0:ch1-fpga-0014:acl1:ch1
FPGALINK4=fpga-0013:acl0:ch2-fpga-0014:acl1:ch2
FPGALINK5=fpga-0013:acl1:ch2-fpga-0014:acl0:ch2
FPGALINK6=fpga-0013:acl0:ch3-fpga-0014:acl1:ch3
FPGALINK7=fpga-0013:acl1:ch3-fpga-0014:acl0:ch3

Ring topology

This setup puts all FPGAs in a ring topology that defines for each FPGA the neighbor FPGAs "north" and "south". It connects each FPGA's channels 0 and 2 to the "north" direction and channels 1 and 3 to the "south" direction. Thus, the local perspective for each node within the topology is

// local view from FPGA "local" to neighbors "north" and "south"
// ch0 and ch2 connect to neighbor "north"
local:ch0 <-> north:ch1
local:ch2 <-> north:ch3
// ch1 and ch1 connect to neighbor "south"
local:ch1 <-> south:ch0
local:ch3 <-> south:ch2

Three different variants define how the FPGAs are arranged into the ring

// --fpgalink="ringO"
// ringO, going down in acl0 column and back up in acl1 column
// Column from north to south, end connected back to start
fpga-0001:acl0
fpga-0002:acl0
fpga-0003:acl0
fpga-0004:acl0
fpga-0004:acl1
fpga-0003:acl1
fpga-0002:acl1
fpga-0001:acl1
// --fpgalink="ringN"
// ringN, going down in acl0 column then down in acl1 column
// Column from north to south, end connected back to start
fpga-0001:acl0
fpga-0002:acl0
fpga-0003:acl0
fpga-0004:acl0
fpga-0001:acl1
fpga-0002:acl1
fpga-0003:acl1
fpga-0004:acl1
// --fpgalink="ringZ"
// ringZ, going down through nodes, zigzaging between acl0 and acl1
// Column from north to south, end connected back to start
fpga-0001:acl0
fpga-0001:acl1
fpga-0002:acl0
fpga-0002:acl1
fpga-0003:acl0
fpga-0003:acl1
fpga-0004:acl0
fpga-0004:acl1

Full example for a ringO with 4 nodes

srun -p fpga -A pc2-mitarbeiter --constraint=19.2.0_max -N 4 --fpgalink=ringO --pty bash
Summarizing most recent topology information and exporting FPGALINK variables:
Host list
fpga-0009
fpga-0010
fpga-0011
fpga-0012
Ring topology information: column from north to south, end connected back to start
fpga-0009:acl0
fpga-0010:acl0
fpga-0011:acl0
fpga-0012:acl0
fpga-0012:acl1
fpga-0011:acl1
fpga-0010:acl1
fpga-0009:acl1
Generated connections
FPGALINK0=fpga-0009:acl0:ch1-fpga-0010:acl0:ch0
FPGALINK1=fpga-0009:acl0:ch3-fpga-0010:acl0:ch2
FPGALINK2=fpga-0010:acl0:ch1-fpga-0011:acl0:ch0
FPGALINK3=fpga-0010:acl0:ch3-fpga-0011:acl0:ch2
FPGALINK4=fpga-0011:acl0:ch1-fpga-0012:acl0:ch0
FPGALINK5=fpga-0011:acl0:ch3-fpga-0012:acl0:ch2
FPGALINK6=fpga-0012:acl0:ch1-fpga-0012:acl1:ch0
FPGALINK7=fpga-0012:acl0:ch3-fpga-0012:acl1:ch2
FPGALINK8=fpga-0012:acl1:ch1-fpga-0011:acl1:ch0
FPGALINK9=fpga-0012:acl1:ch3-fpga-0011:acl1:ch2
FPGALINK10=fpga-0011:acl1:ch1-fpga-0010:acl1:ch0
FPGALINK11=fpga-0011:acl1:ch3-fpga-0010:acl1:ch2
FPGALINK12=fpga-0010:acl1:ch1-fpga-0009:acl1:ch0
FPGALINK13=fpga-0010:acl1:ch3-fpga-0009:acl1:ch2
FPGALINK14=fpga-0009:acl1:ch1-fpga-0009:acl0:ch0
FPGALINK15=fpga-0009:acl1:ch3-fpga-0009:acl0:ch2

Torus topology

This setup puts all FPGAs in a torus topology that defines for each FPGA the neighbor FPGAs "north", "south", "west", "east". It connects each FPGA's channel 0 to the "north" direction, channel 1 to the "south" direction, channel 2 to the "west" direction and channel 3 to the "east" direction. Thus, the local perspective for each node within the topology is

// local view from FPGA "local" to neighbors "north", "south", "west", "east"
// ch0 connects to neighbor "north"
local:ch0 <-> north:ch1
// ch1 connects to neighbor "south"
local:ch1 <-> south:ch0
// ch2 connects to neighbor "west"
local:ch2 <-> west:ch3
// ch3 connects to neighbor "east"
local:ch3 <-> east:ch2

The torus topology can be instantiated with a configurable width, that is number of FPGAs that are connected in "west-east" direction. With an uneven width, FPGAs in the same node can belong to consecutive rows of the torus. The number of FPGAs gets rounded down to the biggest full torus for the given width. The following block illustrates 3 different torus topologies on nodes fpga-[0001-0005]

// --fpgalink="torus2"
// Torus with width 2 and height 5
// Columns from north to south, rows from west to east, end connected back to start
fpga-0001:acl0 - fpga-0001:acl1
fpga-0002:acl0 - fpga-0002:acl1
fpga-0003:acl0 - fpga-0003:acl1
fpga-0004:acl0 - fpga-0004:acl1
fpga-0005:acl0 - fpga-0005:acl1
 
// --fpgalink="torus3"
// Torus with width 3 and height 3
// Columns from north to south, rows from west to east, end connected back to start
fpga-0001:acl0 - fpga-0001:acl1 - fpga-0002:acl0
fpga-0002:acl1 - fpga-0003:acl0 - fpga-0003:acl1
fpga-0004:acl0 - fpga-0004:acl1 - fpga-0005:acl0
 
// --fpgalink="torus4"
// Torus with width 4 and height 2
// Columns from north to south, rows from west to east, end connected back to start
fpga-0001:acl0 - fpga-0001:acl1 - fpga-0002:acl0 - fpga-0002:acl1
fpga-0003:acl0 - fpga-0003:acl1 - fpga-0004:acl0 - fpga-0004:acl1

Full example for a torus4 with 8 nodes

srun -p fpga -A pc2-mitarbeiter --constraint=19.2.0_max -N 8 --fpgalink=torus4 --pty bash
...
Summarizing most recent topology information and exporting FPGALINK variables:
Host list
fpga-0001
fpga-0002
fpga-0003
fpga-0004
fpga-0005
fpga-0006
fpga-0007
fpga-0008
Torus topology with width 4 and height 4
Torus topology information: columns from north to south, rows from west to east, end connected back to start
fpga-0001:acl0 - fpga-0001:acl1 - fpga-0002:acl0 - fpga-0002:acl1
fpga-0003:acl0 - fpga-0003:acl1 - fpga-0004:acl0 - fpga-0004:acl1
fpga-0005:acl0 - fpga-0005:acl1 - fpga-0006:acl0 - fpga-0006:acl1
fpga-0007:acl0 - fpga-0007:acl1 - fpga-0008:acl0 - fpga-0008:acl1
Generated connections
FPGALINK0=fpga-0001:acl0:ch1-fpga-0003:acl0:ch0
FPGALINK1=fpga-0001:acl0:ch3-fpga-0001:acl1:ch2
FPGALINK2=fpga-0001:acl1:ch1-fpga-0003:acl1:ch0
FPGALINK3=fpga-0001:acl1:ch3-fpga-0002:acl0:ch2
FPGALINK4=fpga-0002:acl0:ch1-fpga-0004:acl0:ch0
FPGALINK5=fpga-0002:acl0:ch3-fpga-0002:acl1:ch2
FPGALINK6=fpga-0002:acl1:ch1-fpga-0004:acl1:ch0
FPGALINK7=fpga-0002:acl1:ch3-fpga-0001:acl0:ch2
FPGALINK8=fpga-0003:acl0:ch1-fpga-0005:acl0:ch0
FPGALINK9=fpga-0003:acl0:ch3-fpga-0003:acl1:ch2
FPGALINK10=fpga-0003:acl1:ch1-fpga-0005:acl1:ch0
FPGALINK11=fpga-0003:acl1:ch3-fpga-0004:acl0:ch2
FPGALINK12=fpga-0004:acl0:ch1-fpga-0006:acl0:ch0
FPGALINK13=fpga-0004:acl0:ch3-fpga-0004:acl1:ch2
FPGALINK14=fpga-0004:acl1:ch1-fpga-0006:acl1:ch0
FPGALINK15=fpga-0004:acl1:ch3-fpga-0003:acl0:ch2
FPGALINK16=fpga-0005:acl0:ch1-fpga-0007:acl0:ch0
FPGALINK17=fpga-0005:acl0:ch3-fpga-0005:acl1:ch2
FPGALINK18=fpga-0005:acl1:ch1-fpga-0007:acl1:ch0
FPGALINK19=fpga-0005:acl1:ch3-fpga-0006:acl0:ch2
FPGALINK20=fpga-0006:acl0:ch1-fpga-0008:acl0:ch0
FPGALINK21=fpga-0006:acl0:ch3-fpga-0006:acl1:ch2
FPGALINK22=fpga-0006:acl1:ch1-fpga-0008:acl1:ch0
FPGALINK23=fpga-0006:acl1:ch3-fpga-0005:acl0:ch2
FPGALINK24=fpga-0007:acl0:ch1-fpga-0001:acl0:ch0
FPGALINK25=fpga-0007:acl0:ch3-fpga-0007:acl1:ch2
FPGALINK26=fpga-0007:acl1:ch1-fpga-0001:acl1:ch0
FPGALINK27=fpga-0007:acl1:ch3-fpga-0008:acl0:ch2
FPGALINK28=fpga-0008:acl0:ch1-fpga-0002:acl0:ch0
FPGALINK29=fpga-0008:acl0:ch3-fpga-0008:acl1:ch2
FPGALINK30=fpga-0008:acl1:ch1-fpga-0002:acl1:ch0
FPGALINK31=fpga-0008:acl1:ch3-fpga-0007:acl0:ch2

Legacy topology setup

The following setup is completely replaced by the user configurable job specific topologies. It currently is kept here for reference about earlier measurements.

Some FPGA boards are connected with direct point-to-point connections that are abstracted in the OpenCL environment as Serial Channels. A status reg value of 0xfff11ff1 in the diagnose indicates an active connection. Currently available connections documented as

<nodename>:<devicename>:<channelname> 
with 
<nodename>    in fpga-0001 -- fpga-0016
<devicename>  in acl0, acl1
<channelname> in ch0, ch1, ch2, ch3

Islands with 4 FPGAs

fpga-0010 + fpga-0011

// four FPGAs with all-to-all connections
// ch0 realizes vertical connections
fpga-0010:acl0:ch0 <-> fpga-0011:acl0:ch0
fpga-0010:acl1:ch0 <-> fpga-0011:acl1:ch0
// ch1 realizes horizontal connections
fpga-0010:acl0:ch1 <-> fpga-0010:acl1:ch1
fpga-0011:acl0:ch1 <-> fpga-0011:acl1:ch1
// ch2 realizes diagonal connections
fpga-0010:acl0:ch2 <-> fpga-0011:acl1:ch2
fpga-0010:acl1:ch2 <-> fpga-0011:acl0:ch2

fpga-0015 + fpga-0016

// four FPGAs with all-to-all connections
// ch0 realizes vertical connections
fpga-0015:acl0:ch0 <-> fpga-0016:acl0:ch0
fpga-0015:acl1:ch0 <-> fpga-0016:acl1:ch0
// ch1 realizes horizontal connections
fpga-0015:acl0:ch1 <-> fpga-0015:acl1:ch1
fpga-0016:acl0:ch1 <-> fpga-0016:acl1:ch1
// ch2 realizes diagonal connections
fpga-0015:acl0:ch2 <-> fpga-0016:acl1:ch2
fpga-0015:acl1:ch2 <-> fpga-0016:acl0:ch2

Nodes with internal connections of 2 FPGAs

fpga-0012

// four connections from one board to the other
fpga-0012:acl0:ch0 <-> fpga-0012:acl1:ch0
fpga-0012:acl0:ch1 <-> fpga-0012:acl1:ch1
fpga-0012:acl0:ch2 <-> fpga-0012:acl1:ch2
fpga-0012:acl0:ch3 <-> fpga-0012:acl1:ch3

fpga-0013

// four connections from one board to the other
fpga-0013:acl0:ch0 <-> fpga-0013:acl1:ch0
fpga-0013:acl0:ch1 <-> fpga-0013:acl1:ch1
fpga-0013:acl0:ch2 <-> fpga-0013:acl1:ch2
fpga-0013:acl0:ch3 <-> fpga-0013:acl1:ch3

fpga-0014

// two connections from one board to the other
fpga-0014:acl0:ch0 <-> fpga-0014:acl1:ch0
fpga-0014:acl0:ch3 <-> fpga-0014:acl1:ch3


A torus connecting 5 nodes and 10 FPGAs

14 August 2019: Torus updated from 4 to 5 nodes

The topology forms 2 colums that span all nodes and 4 rows that connect the FPGAs within each node.

fpga-0005:acl0 - fpga-0005:acl1
fpga-0006:acl0 - fpga-0006:acl1
fpga-0007:acl0 - fpga-0007:acl1
fpga-0008:acl0 - fpga-0008:acl1
fpga-0009:acl0 - fpga-0009:acl1
// e.g. 
// the "north" neighbor of fpga-0005:acl0 is fpga-0009:acl0 (wrap around)
// the "south" neighbor of fpga-0005:acl0 is fpga-0006:acl0
// the "west" neighbor of fpga-0005:acl0 is fpga-0005:acl1 (wrap around)
// the "east" neighbor of fpga-0005:acl0 is fpga-0005:acl1

The local view of connections as seen from within a local node is as follows.

// local view from FPGA "local" to neighbors "north", "south", "west", "east"
// ch0 connects to neighbor "north"
local:ch0 <-> north:ch1
// ch1 connects to neighbor "south"
local:ch1 <-> south:ch0
// ch2 connects to neighbor "west"
local:ch2 <-> west:ch3
// ch3 connects to neighbor "east"
local:ch3 <-> east:ch2

The complete set of connections is as follows

fpga-0005:acl0:ch1 <-> fpga-0006:acl0:ch0
fpga-0005:acl0:ch3 <-> fpga-0005:acl1:ch2
fpga-0005:acl1:ch1 <-> fpga-0006:acl1:ch0
fpga-0005:acl1:ch3 <-> fpga-0005:acl0:ch2
fpga-0006:acl0:ch1 <-> fpga-0007:acl0:ch0
fpga-0006:acl0:ch3 <-> fpga-0006:acl1:ch2
fpga-0006:acl1:ch1 <-> fpga-0007:acl1:ch0
fpga-0006:acl1:ch3 <-> fpga-0006:acl0:ch2
fpga-0007:acl0:ch1 <-> fpga-0008:acl0:ch0
fpga-0007:acl0:ch3 <-> fpga-0007:acl1:ch2
fpga-0007:acl1:ch1 <-> fpga-0008:acl1:ch0
fpga-0007:acl1:ch3 <-> fpga-0007:acl0:ch2
fpga-0008:acl0:ch1 <-> fpga-0009:acl0:ch0
fpga-0008:acl0:ch3 <-> fpga-0008:acl1:ch2
fpga-0008:acl1:ch1 <-> fpga-0009:acl1:ch0
fpga-0008:acl1:ch3 <-> fpga-0008:acl0:ch2
fpga-0009:acl0:ch1 <-> fpga-0006:acl0:ch0
fpga-0009:acl0:ch3 <-> fpga-0009:acl1:ch2
fpga-0009:acl1:ch1 <-> fpga-0006:acl1:ch0
fpga-0009:acl1:ch3 <-> fpga-0009:acl0:ch2

Troubleshooting

CL_INVALID_PROGRAM_EXECUTABLE with fast emulation

When using the fast emulator along with host code that was previously tested with the legacy emulator and/or hardware execution, you may encounter a problem with during execution that corresponds to the OpenCL error code CL_INVALID_PROGRAM_EXECUTABLE. To fix this issue, your host code needs to invoke clBuildProgram (C API) or program.build() (C++ API). This invocation is required for any normal OpenCL code, but with legacy emulation and hardware execution, it was not required and could be skipped.

FPGA programmed with bitstreams built with different SDK versions in the same session

Error message during bitstream programming from host code or with aocl program

FAILED to read auto-discovery string at byte 2: Expected version is 19, found 20
Error: The currently programmed/flashed design is no longer supported in this release. Please recompile the design with the present version of the SDK and re-program/flash the board.
acl_hal_mmd.c:1460:assert failure: Failed to initialize kernel interfacemain: acl_hal_mmd.c:1460: l_try_device: Assertion `0' failed.

This or similar error messages come up when invoking host code or aocl commands after a bitstream that was built with an earlier SDK version was configured. Workaround:

  • Load the latest intelFPGA_pro module (e.g. 19.3.0)
  • Configure the target bitstream (e.g. built with 19.2.0 SDK) using aocl program or your OpenCL host code
  • Optionally [reload the target intelFPGA_pro module that was used when building the bitstream]

LOCALE settings forwarded from your computer

Error message in quartus_sh_compile.log

Internal Error: Sub-system: CFG_INI, File: /quartus/ccl/cfg_ini/cfg_ini_reader.cpp, Line: 1530
Couldn't parse ini setting qspc_nldm_max_step_size=10.0 as a floating point value
Stack Trace:
     0xb4fe: err_report_internal_error(char const*, char const*, char const*, int) + 0x1a (ccl_err)
    0x17b45: cfg_get_double_value(std::string const&, double) + 0xe4 (ccl_cfg_ini)
    0x8f788: CFG_INI_DOUBLE::refresh() + 0x48 (tsm_qspc)
...
Error (23035): Tcl error: couldn't open "top.fit.rpt": no such file or directory
    while executing
"open $report"
    (procedure "fetch_pseudo_panel" line 3)
    invoked from within
"fetch_pseudo_panel $report "Found \[0-9\]* clocks" {1 0} 2"
    (procedure "fetch_clock_periods" line 6)
    invoked from within
"fetch_clock_periods $report"
    (procedure "fetch_clock" line 2)
    invoked from within
"fetch_clock "$revision_name.fit.rpt" $clkname"
    (procedure "get_fmax_from_report" line 8)
    invoked from within
"get_fmax_from_report $k_clk_name 1 $recovery_multicycle $iteration"
    (procedure "get_kernel_clks_and_fmax" line 5)
    invoked from within
"get_kernel_clks_and_fmax $k_clk_name $k_clk2x_name $recovery_multicycle $iteration"
    (file "/cm/shared/opt/intelFPGA_pro/19.4.0/hld/ip/board/bsp/adjust_plls.tcl" line 815)
    invoked from within
"source "$sdk_root/ip/board/bsp/adjust_plls.tcl""
    (file "scripts/post_flow_pr.tcl" line 59)
Error (23031): Evaluation of Tcl script scripts/post_flow_pr.tcl unsuccessful
...
Error: Quartus Fitter has failed! Breaking execution...
Error (23035): Tcl error: 
    while executing
"qexec "quartus_cdb -t scripts/post_flow_pr.tcl \"$top_path\"""
    invoked from within
"if {$revision_name eq "top"} {

  post_message "Compiling top revision..."

  # Load OpenCL BSP utility functions
  source "$sdk_root/ip/board/bsp/ope..."
    (file "compile_script.tcl" line 40)
Error (23031): Evaluation of Tcl script compile_script.tcl unsuccessful
Error: Quartus Prime Compiler Database Interface was unsuccessful. 3 errors, 0 warnings
    Error: Peak virtual memory: 1021 megabytes
    Error: Processing ended: Mon Mar 30 14:47:15 2020
    Error: Elapsed time: 03:06:43
    Error: System process ID: 21428

The root cause is outlined in the first message of the above excpert from quartus_sh_compile.log: parsing of a number as floating point failed. This can be caused by locale settings that are transferred from the computer you connect with to Noctua. After connecting to Noctua, check your locale settings with locale, and possibly change them with export LC_NUMERIC="en_US.UTF-8"

[tester@fe-1 matrix_mult]$ locale
...
LC_NUMERIC="de_DE.UTF-8" // can cause above error
[tester@fe-1 matrix_mult]$ export LC_NUMERIC="en_US.UTF-8"
[tester@fe-1 matrix_mult]$ locale
...
LC_NUMERIC="en_US.UTF-8" // known to work
...