Local memory implementation - Memory architecture optimization

application on Xilinx FPGA

4.3 Memory architecture optimization

4.3.1 Local memory implementation

Acceleration of LDPC application on Xilinx FPGA

It can be verified that during the kernels execution most of the time is spent to per-form DRAM accesses. Regarding ldpc_cnp_kernel and ldpc_vnp_kernel kernels they have comparable duration times as discussed before, moreover also for them memory operations consumes most of the execution time.

In order to have a reference metrics, a simulation with the same input data set and parameters is run on a 3.2GHz Intel i7-6900K processor with the AVX2 solu-tion. The execution time measured for the AVX2 implementation is 257.549 µs, which is very low. If one compares the execution of the AVX2 with the OpenCL one the acceleration factor of AVX2 (execution of OpenCL over AVX2) is 149699x, whilst the acceleration factor for the GPU is 358353x. Hence, the OpenCL solu-tion for the moment is very far from the other two implementasolu-tions in terms of performance, on the other hand the acceleration of the GPU code with respect to the AVX2 one is 2.39x.

To summarize the results obtained so far, the four kernels do not have a uni-form time occupation, in particular one of the kernels is taking large time because of the initialization of the matrices used for intermediate computation. Addition-ally, most of the decoder time is spent in DRAM interactions. One of the kernels (pack_decoded_bit) can be ignored during the optimization flow since it achieves the best performance (if compared with other kernels).

The first optimization step is to have a uniform time occupation for the three crit-ical kernels, especially the first one. Therefore DRAM accesses must be reduced and repeated operations must be eliminated in order to avoid limitations due to initialization tasks.

Acceleration of LDPC application on Xilinx FPGA

work groups do not interfere between them. It is important to remember the work item and work group organization: check node kernels have 46 work groups with 384 work items each. As described in chapter 3, LDPC codes in OAI are cyclic, hence each entry of the base graph is replaced by a shifted identity matrix which is ZCxZC. The number of rows of the base graph is multiplied by ZC, thus in the OpenCL implementation each work group has a number of ZC work items (384 in this case) to process the shifted identity matrix corresponding to an entry of the base graph. Therefore each work group do not access to the variable used by another one since each one has its own row to work on.

On the other hand bit node processing kernel has 68 work groups with ZC equals 384 work items to elaborate the parity check matrix columns. Also in this case the groups do not use shared variables between them.

Thus local memory can be used to store the results that check nodes and bit nodes exchange, improving the access time to use the data.

Another limitation carried by local memory is that it is not shared between ker-nels, hence each kernel has its dedicated local memories, therefore some modifica-tion in the host code and in the kernel code must be applied. For this reason, start-ing from the current new solution, the ldpc_cnp_kernel_1st_iter, ldpc_cnp_kernel, ldpc_vnp_kernel_normal kernels are transformed into functions and merged into a single kernel called nrLDPC_decoder kernel. The pack_decoded_bit is kept as standalone kernel.

From the host code side, the loop to launch the kernels shown in 3.21 has been changed since now only two kernels must be executed. Check node and bit node kernels are now merged into a single one, hence they are executed according to a support variable, namely index, received by the host code. Thus the host loop now has the following structure:

1 for ( int i i = 0 ; i i < numMaxIter ; i i ++){

2 i f( i i ==0){

3 index =1;

4 arg =0;

5 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &index ) ) ;

6 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( cl_mem ) , &d e v _ c o n s t _ l l r ) ) ; 7 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( cl_mem ) , &d e v _ l l r ) ) ; 8 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &BG) ) ; 9 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &row ) ) ; 10 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &c o l ) ) ; 11 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &Zc ) ) ;

12 OCL_CHECK( clEnqueueNDRangeKernel ( commands , nrLDPC_decoder , 3 , NULL, dimGridKernel2 , dimBlockKernel2 , 0 , NULL, 0) ) ;

13 OCL_CHECK( c l F i n i s h ( commands ) ) ;

14 }

15 e l s e{

16 index =2;

17 arg =0;

18 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &index ) ) ;

19 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( cl_mem ) , &d e v _ c o n s t _ l l r ) ) ; 20 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( cl_mem ) , &d e v _ l l r ) ) ; 21 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &BG) ) ; 22 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &row ) ) ; 23 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &c o l ) ) ; 24 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &Zc ) ) ;

25 OCL_CHECK( clEnqueueNDRangeKernel ( commands , nrLDPC_decoder , 3 , NULL, dimGridKernel2 , dimBlockKernel2 , 0 , NULL, 0) ) ;

26 OCL_CHECK( c l F i n i s h ( commands ) ) ;

27 }

28 i f( i i +1 != numMaxIter ) {

Acceleration of LDPC application on Xilinx FPGA

29 index =3;

30 arg =0;

31 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &index ) ) ;

32 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( cl_mem ) , &d e v _ c o n s t _ l l r ) ) ; 33 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( cl_mem ) , &d e v _ l l r ) ) ; 34 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &BG) ) ; 35 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &row ) ) ; 36 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &c o l ) ) ; 37 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &Zc ) ) ;

38 OCL_CHECK( clEnqueueNDRangeKernel ( commands , nrLDPC_decoder , 3 , NULL, dimGridKernel2 , dimBlockKernel2 , 0 , NULL, 0) ) ;

39 OCL_CHECK( c l F i n i s h ( commands ) ) ;

40 }

41 e l s e

42 {

43 index =4;

44 arg =0;

45 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &index ) ) ;

46 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( cl_mem ) , &d e v _ c o n s t _ l l r ) ) ; 47 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( cl_mem ) , &d e v _ l l r ) ) ; 48 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &BG) ) ; 49 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &row ) ) ; 50 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &c o l ) ) ; 51 OCL_CHECK( c l S e t K e r n e l A r g ( nrLDPC_decoder , arg++, s i z e o f ( int ) , &Zc ) ) ;

52 OCL_CHECK( clEnqueueNDRangeKernel ( commands , nrLDPC_decoder , 3 , NULL, dimGridKernel2 , dimBlockKernel2 , 0 , NULL, 0) ) ;

53 OCL_CHECK( c l F i n i s h ( commands ) ) ;

54 }

55 }

56 s i z e _ t pack = ( b l o c k _ l e n g t h /128) +1;

57 s i z e _ t pack_block [ 3 ] = { ( pack −1) ∗128 , MC, 1 } ; 58 s i z e _ t p a c k _ l o c a l [ 3 ] = { 1 2 8 , 1 , 1 } ;

59 arg = 0 ;

60 OCL_CHECK( c l S e t K e r n e l A r g ( pack_decoded_bit , arg++, s i z e o f ( cl_mem ) , &d e v _ l l r ) ) ; 61 OCL_CHECK( c l S e t K e r n e l A r g ( pack_decoded_bit , arg++, s i z e o f ( cl_mem ) , &dev_tmp ) ) ; 62 OCL_CHECK( c l S e t K e r n e l A r g ( pack_decoded_bit , arg++, s i z e o f ( int ) , &c o l ) ) ; 63 OCL_CHECK( c l S e t K e r n e l A r g ( pack_decoded_bit , arg++, s i z e o f ( int ) , &Zc ) ) ;

64 OCL_CHECK( clEnqueueNDRangeKernel ( commands , pack_decoded_bit , 3 , NULL, pack_block , pack_local , 0 , NULL, 0) ) ;

65 OCL_CHECK( c l F i n i s h ( commands ) ) ;

Listing 4.2. New host code loop after kernels merge, index variable is used to execute the proper function in the kernel code.

The new kernel, nrLDPC_decoder, receives the decoder parameters like the kernels of the previous implementation.

The memory objects dev_h_compact1, dev_h_compact2 and dev_dt are no more set as kernel arguments since they are implemented as local memory in the kernel code side. The additional parameter is the index variable which tells the kernel code which function must be invoked.

When the variable index is set to one the ldpc_cnp_kernel_1st_iter is executed.

If it is 2 ldpc_cnp_kernel is called. The values 3 and 4 are used to tell the kernel code to invoke the ldpc_vnp_kernel_normal function, two values are set in order to specify when to write by bursts.

In the previous solution, check node and bit node kernels had different global size but the local size was the same. Because of the merge, the work group size of the new kernel is still equal to ZC (384) and the global size is equal to the maximum global size of the previous kernels, which corresponds to the number of columns multiplied by ZC. For the given parameters the total number of work items is 26112 which corresponds to the number of bits of the codeword. Since in the pre-vious solution ldpc_cnp_kernel_1st_iter and ldpc_cnp_kernel kernels have less work items with respect to ldpc_vnp_kernel_normal, an if statement in the kernel code is required to specify how many work items must be executed when those

Acceleration of LDPC application on Xilinx FPGA

functions are called. The pack_decoded_bit is not modified in this solution.

Nel documento Elenco delle tesi il cui relatore è "Scarpina, Salvatore" - Webthesis (pagine 69-72)