SDAccel environment - application on Xilinx FPGA

application on Xilinx FPGA

4.1 SDAccel environment

Chapter 4

Acceleration of LDPC application on Xilinx FPGA

the other one waits its turn to use the memory. In other words, the host code first sets the execution environment for the FPGA and load into the DRAM memory the input data, in this interval the FPGA cannot use the memory. Then the host launches the kernels on the FPGA and loses the control of the memory, which belongs to the FPGA until it has completed the computation. Once the FPGA kernels return, the control is given back to the host code which can read the results from DRAM.

The FPGA runs the kernel code which is a compute intensive application that must be accelerated. Speaking of terminology, a kernel is a function which is executed on the FPGA, more kernels can be executed on a single FPGA (concurrently or not). To improve parallelism, more copy of the same kernel can be executed in parallel, the single execution of a kernel is defined as work item, a set of work items is named work group. The size of a work group is decided by the host code. One work item is identified by the id of the group (group_id) and by the work item id in the group (local_id). To identify one work items among all the work items of the application the global_id must be used. Kernels execution are put inside a command queue which can execute the kernels in order or out of order, the choice is taken by the host code. More queues can be used to improve parallelism but synchronization between them is required if kernels are sharing variables.

The command queues can be controlled by the host code with proper callback functions, but the synchronization between work items and work groups is not under the user control and can be done in any order. In OpenCL work items and work groups are described in a 3 dimensional matrix called NDRange, whose sizes are the global size (related to work items) and the local size (related to work groups). This 3D space is used to execute the kernels according to the host code specifications. Hence, if the number of work items specified for the kernel is greater than one, the kernel code is inserted into a loop which is related to the NDRange.

This outmost loop will iterate through the work items of a group.

The work item and work group organization is important since it is related to the memory hierarchy used by OpenCL. Considering the reference guide of the API calls by the Khronos group [24], in OpenCL four memory address spaces are available on the FPGA:

1. Global which is shared by all work items and work groups. Read and Write memory.

2. Local which is shared by the work items of a work group. Read and Write memory.

3. Private which is accessible by one work items only. Read and Write memory.

4. Constant memory which is shared by all work items and work groups. Read only memory.

Acceleration of LDPC application on Xilinx FPGA

Regarding the off chip memory, the DRAM memory, has both a read write address space and a read only one. The off chip memory has the biggest storage but has the largest latency. The global (on-chip) memory is implemented as SRAM in small blocks (BRAM), also the local memory is implemented as BRAM memory. Local memory is smaller than on chip memory, thus the access time is much smaller, the drawback is the accessibility by work groups. The fastest memory of all is the private one but it is also the one with the smallest storage.

From the design point of view, SDAccel offers three build targets: the software emulation, the hardware emulation and the system one. The designer use software emulation to check the syntax of the kernel code and verify the functionality, it is the fastest build and it is executed on the x86 machine.

Hardware emulation uses the generated RTL by Vivado HLS to verify that the generated hardware works as expected. This emulation runs on a dedicated envi-ronment which is very slow and is very time consuming, thus a small data set is suggested to run and complete the emulation.

Finally the system target generates the bitstream that will be deployed on the FPGA using the RTL code from Vivado HLS. This target is run directly on the FPGA and provides the actual results of the generated hardware.

The design flow adopted for this work is the following: software emulation is used to verify the functionality of the code after code modifications. The hardware em-ulation of the target shows no data transfer between host and the device because of corrupted platform file that is used to communicate with the FPGA. Hence this emulation step is skipped. After software emulation the system target is generated and the bitstream is deployed on the target device xcvu9p-flgb2104-2-i belonging to the Xilinx VirtexUltrascale+ family. The FPGA has been provided by AWS, which is a cloud computing service by Amazon.

Regarding the profiling and optimization part of the design, SDAccel provides different tools to find the weak points of the application, also suggestions of possible improvement are given. The tools used in this work are the timeline trace, the HLS reports, the system estimate report and the profile summary. These reports are not always generated since the emulation builds do not provide all the information to SDAccel to generate completely the reports [25].

The timeline trace is a graphical reports showing the data transfer, kernel execution and OpenCL API calls on a timeline axis. This report is generated automatically for the hardware emulation. For hardware build the timeline trace generation must be enabled at compile time. If proper options are given to the compiler the stalls, execution times and data transfer from kernel to DRAM are shown.

The HLS report is generated by Vivado HLS, it is an estimation of the synthesis, therefore the results reported are not actual results. It contains information about the resource usage, latency of the loops in the design, the estimated clock period and details regarding the logic implementation.

The system estimate report is generated automatically for hardware emulation and

Acceleration of LDPC application on Xilinx FPGA

system builds, it provides high level details of the kernels.

The profile summary contains comments and accurate details of the kernels. It is divided in sections and provides details regarding the API calls, the data transfer from/to host to/from the FPGA. Also a section with comments regarding possible improvements is shown. It is generated for all the builds after kernel execution.

For software emulation data transfer is omitted.

In the design flow used in this thesis the software emulation is used for functional-ity verification whilst the hardware reports are used for optimizations for the next version of the code.

Nel documento Elenco delle tesi il cui relatore è "Scarpina, Salvatore" - Webthesis (pagine 59-62)