LDPC decoder for GPUs - LDPC in OpenAirInterface

LDPC in OpenAirInterface

3.4 LDPC decoder for GPUs

LDPC in OpenAirInterface

spent in trying the new code and then debugging it. Therefore the code described up to now has been quit and the focus has been moved to the GPU version of LDPC.

LDPC in OpenAirInterface

this solution is that the base graph must be scanned twice to fill first h_compact1 and h_compact2 later. The value stored in these matrices is not a binary one but still a natural number, therefore for each iteration of the decoder the modulo operation to retrieve the circular shift is performed.

The data type of h_compact1 and h_compact2 is a data structure called h_element, which contains the x and y coordinates in the matrix (char type) and the value stored in that entry (short type), which is used to derive the circular shift coefficient for the identity matrix.

The first matrix is used for check node processing, it is a 1 dimension matrix with 46·19 elements, hence it has the size of the largest BG row (46 for BG1). In other words, each rows has 19 elements, in this way the matrix has been compressed and the processing would be improved. To know exactly how many columns one row has, there are constant vectors called h_ele_row_bg_count with 46 or 42 elements reporting the precise amount.

Similarly, for variable node (or bit node) processing the h_compact2 vector is used with 68·30 elements. The number of rows for each column is reported in the con-stant vector h_ele_col_bg_count which has 52 or 68 elements depending on the base graph.

After the initialization of the matrices for processing, the host part of the code allocates buffers and copy the content of h_compact1 and h_compact2 in the cor-responding vectors on the device memory which are dev_h_compact1 and

dev_h_compact2. Also the content of the channel_output_fixed from the test-bench is copied in two different device buffer: dev_const_llr and dev_llr. The first one is a constant one since it will store the channel samples and it will not be modified. The second buffer is a temporary buffer storing the intermediate LLRs value, it is filled with the channel samples for the first iteration and it is updated by the bit node processing kernel. The size of dev_llr and dev_const_llr is ZC·n_col·sizeof(char).

Regarding the size of dev_h_compact1 and dev_h_compact2 they have the same size of h_compact1 and h_compact2 respectively.

The dev_dt is the last buffer allocated and has the size of the parity check matrix:

n_row·n_col·ZC·sizeof(char). It is used for data transfer between check node and bit node kernels.

Finally, after buffer allocation the data are copied from the CPU side to the GPU side, then the host body loop is executed¹⁰:

1 for ( int i i = 0 ; i i < MAX_ITERATION; i i ++){

2 i f( i i == 0) { // f i r s t k e r n e l

3 ldpc_cnp_kernel_1st_iter

4 <<<dimGridKernel1 , dimBlockKernel1>>>

5 ( d e v _ l l r , dev_dt , BG, row , c o l , Zc ) ;

10https://gitlab.eurecom.fr/oai/openairinterface5g/-/tree/develop/openair1/PHY/

CODING/nrLDPC_decoder_LYC/nrLDPC_decoder_LYC.cu

LDPC in OpenAirInterface

6 } e l s e { // s e c o n d k e r n e l

7 ldpc_cnp_kernel

8 <<<dimGridKernel1 , dimBlockKernel1>>>

9 ( d e v _ l l r , dev_dt , BG, row , c o l , Zc ) ;

10 }

11 ldpc_vnp_kernel_normal

12 <<<dimGridKernel2 , dimBlockKernel2>>>

13 ( d e v _ l l r , dev_dt , dev_const_llr , BG, row , c o l , Zc ) ; 14 }15 int pack = ( b l o c k _ l e n g t h /128) +1;

16 dim3 pack_block ( pack , MC, 1) ;

17 pack_decoded_bit<<<pack_block ,128>>>( d e v _ l l r , dev_tmp , c o l , Zc ) ;

Listing 3.21. GPU host body loop which launches the kernels on the device In the loop body of the host code the four kernels are launched and executed on the GPU. The for loop reported in listing 3.21 in each iteration launches two kernels, the first one is one of the two check node kernels and the second one is the bit node kernel. The choice of the check node kernel depends on the iteration counter since in the first iteration of the loop the decoder uses the channel samples for the check node processing. Every time the host code launches a kernel it has to set the ker-nel arguments and has to specify the dimension of the execution units. In CUDA environment the kernels are executed by threads. A set of thread forms one block.

Multiple blocks are grouped in a grid. More blocks can be run simultaneously in order to improve parallelism and speed up the kernel execution. In this case, the block dimension (i.e. the number of thread per block) is equal to the lifting factor accepted as parameters of the decoder. The grid size (the number of block per grid) depends on the kernel, it is equal to the number of base graph rows for the check node kernels and it corresponds to the number of columns for the bit node kernel.

When the maximum number of iteration is reached, the last kernel is executed, it is the packing kernel (corresponding to the llr2bitPacked of the AVX2 code). It has 128 threads per block and (block_length/128)+1 blocks. After the conversion to bits the content of dev_llr buffer is copied from device memory to the host memory side.

Regarding the kernels part, the check node kernels are named as ldpc_cnp_kernel and ldpc_cnp_kernel_1st_iter. The bit node one is the ldpc_vnp_normal and the packing kernel is called pack_decoded_bit.

The ldpc_cnp_kernel_1st_iter does not use the dev_dt buffer for computations and use the channel samples to evaluate if the check nodes are satisfied given the current LLRs. The results are stored in dev_dt. The algorithm used is the same of the AVX2 code but it is splitted in two separated loop: the first one is related to the believes evaluation whilst in the second one the results are stored in the global memory buffer dev_dt. The ldpc_cnp_kernel kernel is identical to the one previously mentioned, except that in the first part the dev_dt buffer is read and used for computation.

The bit node kernel reads the believes of the check nodes from the dev_dt buffer and computes the new LLR estimation (intrinsic and extrinsic), the results is written in dev_llr buffer. In the next iteration ldpc_cnp_kernel will read the new believes from the bit nodes and will remove the intrinsic information from the value read in

LDPC in OpenAirInterface

dev_llr.

The pack_decoded_bit kernel has only one for loop which is iterated 8 times. First the kernel performs the hard decision on the final LLRs then, after threads synchro-nization, the loop stores the results in the global memory buffer dev_tmp, whose size are ZC·n_col·sizeof(char).

Since the CUDA code is used as reference model and is converted in OpenCL in order to be used in the SDAccel environment, the code is simulated. The simulation is run on a P2000 Quadro GPU, and the simulation parameters are such that the biggest amount of data set is used by the decoder. Namely, the code rate is ¹₃, the block length is 8448 with a single segment, therefore the base graph used is the first one and the lifting factor ZC is 384. Then the decoder has to work with 26112 bits. The maximum number of decoder iteration is 5 and the simulation step is 1dB (SNR). The execution time of the decoder to produce the final result when the SNR is 4dB is equal to 107.589 µs.

Chapter 4

Nel documento Elenco delle tesi il cui relatore è "Scarpina, Salvatore" - Webthesis (pagine 55-59)