• Non ci sono risultati.

GPU programming basics

N/A
N/A
Protected

Academic year: 2021

Condividi "GPU programming basics"

Copied!
17
0
0

Testo completo

(1)

GPU

programming basics

Prof. Marco Bertini

(2)

Data

parallelism:

GPU computing

(3)

CUDA:

libraries

(4)

Why use libraries ?

Libraries are one of the most efficient ways to program GPUs, because they encapsulate the

complexity of writing optimized code for common algorithms into high-level, standard interfaces.

There is a wide variety of high-performance libraries available for NVIDIA GPUs.

(5)

Thrust

Thrust is a library that has been included in CUDA

Inspired by STL. Just include the header: it’s a template library.

Provides many algorithms like reduce, sort, scan, transformations, search.

Includes OpenMP backend for multicore programming

Allows transparent use of GPU

60

!!"!#$

!  %&'('$')*+,$-./0$,112+$./3$2*4(.(*'+$51($678$

9(1:(.--*/:$

!  %&(;+,$*+$/1<$9.(,$15$,&'$=8>?$@>A$

!  =BB$2*4(.(*'+$51($=8>?$9(1:(.--*/:C$*/+9*('3$40$@%D$

!  E./0$*-91(,./,$.2:1(*,&-+F$

!  ('3;G'C$+1(,C$('3;G'H40HI'0C$+G./C$J$

!  >(.-.,*G.220$('3;G'+$1K'(&'.3$15$-./.:*/:$

&','(1:'/'1;+$-'-1(0$+9.G'+$

!  L/G2;3'+$M9'/E7$4.GI'/3$51($-;2,*G1('$9(1:(.--*/:$

Tuesday, October 11, 11

(6)

Thrust Hello World

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <thrust/sort.h>

// generate 200K random numbers on the host thrust::host_vector<int> h_vec(300000);

thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to device

thrust::device_vector<int> d_vec = h_vec;

// sort data on device

thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to host

thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

(7)

Thrust vector

Thrust provides two vector containers, host_vector and device_vector. 


As the names suggest, host_vector is stored in host memory while device_vector lives in GPU device memory.

Thrust’s vector containers are just like std::vector in the C++ STL.

(8)

Thrust iterators

They point to regions of a vector

Can be used like pointers

Can be converted to raw pointers to interface with CUDA


thrust::device_vector<int>::iterator begin = 


d_vec.begin();


int * d_ptr = thrust::raw_pointer_cast(begin);


kernel<<<10, 128>>>(d_ptr);

(9)

From CUDA to Thrust

Raw pointers can be used in Thrust. Once the raw pointer has been wrapped by a device_ptr it can be used like an ordinary Thrust iterator. 


int* d_ptr;


cudaMalloc((void**)&d_ptr, N);


thrust::device_ptr<int> d_vec = 


thrust::device_pointer_cast(d_ptr);


thrust::sort(d_vec, d_vec+N);


cudaFree(d_ptr);

(10)

Thrust: reduction

#include <thrust/host_vector.h>


#include <thrust/device_vector.h>


#include <thrust/generate.h>


#include <thrust/reduce.h>


#include <thrust/functional.h>


#include <algorithm>

// generate random data serially


thrust::host_vector<int> h_vec(100);


std::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and compute sum


thrust::device_vector<int> d_vec = h_vec;

int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, 


thrust::plus<int>());

(11)

CuBLAS

Many scientific computer applications need high-

performance matrix algebra. BLAS is a famous (very optimized) library for such operations.

The NVIDIA cuBLAS library is a fast GPU-accelerated implementation of the standard basic linear algebra

subroutines (BLAS).

Include <cublas.h>

Link the CuBLAS library files

e.g. in CMAKE:

cuda_add_cublas_to_target(target_name)

(12)

CuBLAS data layout

Historically BLAS has been developed in Fortran, and to maintain compatibility CuBLAS uses column-major

storage, and 1-based indexing.

For natively written C and C++ code, one would most likely choose 0-based indexing, in which case the array index of a matrix element in row “i” and column “j” can be computed via the following macro:


#define IDX2C(i,j,ld) (((j)*(ld))+(i))


where ld refers to the leading dimension of the matrix, which in the case of column-major storage is the number of rows of the allocated matrix

(13)

CuBLAS data layout

Historically BLAS has been developed in Fortran, and to maintain compatibility CuBLAS uses column-major

storage, and 1-based indexing.

For natively written C and C++ code, one would most likely choose 0-based indexing, in which case the array index of a matrix element in row “i” and column “j” can be computed via the following macro:


#define IDX2C(i,j,ld) (((j)*(ld))+(i))


where ld refers to the leading dimension of the matrix, which in the case of column-major storage is the number of rows of the allocated matrix

Example:

float* a = (float *)malloc (M * N * sizeof (*a));

for (j = 0; j < N; j++) {

for (i = 0; i < M; i++) {

a[IDX2C(i,j,M)] = (float)(i * M + j + 1);

} }

(14)

CuBLAS SAXPY

float* d_x;


cudaMalloc((void**)&d_x,N*sizeof(float));


float* d_y;


cudaMalloc((void**)&d_y,N*sizeof(float));

cublasInit(); // init. CuBLAS context
 // copy vectors to device memory


cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);


cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);

// Perform SAXPY on N elements


cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);

cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);


cublasShutdown();

(15)

Fast math

Adding -use_fast_math option forces to use

intrinsic math functions for several operations like divisions, log, exp and sin/cos/tan

Work only in device code, of course

These are single precision functions, so they could be less precise than working on doubles.

(16)

C++

Latest versions of CUDA support (since CUDA 7.0) more and more C++11 features like:

auto, lambda expressions, rvalues, nullptr, default and deleted methods.

C++11 concurrency is completely missing, though.

Since CUDA 9.0 also almost all C++14 features are supported.

(17)

Books

Programming Massively Parallel Processors: A

Hands-on Approach, D. B. Kirk and W-M. W. Hwu, Morgan Kaufmann - 2nd edition - Chapt. 16


or



Programming Massively Parallel Processors: A

Hands-on Approach, D. B. Kirk and W-M. W. Hwu, Morgan Kaufmann - 3rd edition - Appendix B

Riferimenti

Documenti correlati

This hollow cathode represents the first prototype for HETs, designed completely at the Centrospazio in a previous work, so the experimental activity has been focused on the

L’ultima fase del lavoro svolto riguarda l’attività di analisi ed ottimizzazione dei processi di gestione del “3 for Business Club” che, a partire dalla rilevazione

Nella quinta parte è stata proposta una codifica dei guasti suscettibili di accadere sul cantiere, mentre nella sesta sono state avanzate proposte volte a migliorare il

Numerical results show that both protocols enhance performances of the al- ready existent round robin scheduler both in terms of capacity and fairness.. Fair- ness is obviously

As for the value of Resistance to Uniaxial Compressive Strenght, the survey refers to the results obtained by the compressive press, being it a direct test; the instrument

Il sistema sviluppato è stato quindi utilizzato per il rilievo e la ricostruzione di casse di compressori presso gli stabilimenti di Massa della Nuovo Pignone.. Gli ottimi

• In fact, threads in a block can access a very fast shared (private) memory, visible only inside the block. • Then we could just read blockDim.x items to the shared

maggiore di (greater