• Non ci sono risultati.

GPU Programming

N/A
N/A
Protected

Academic year: 2021

Condividi "GPU Programming"

Copied!
166
0
0

Testo completo

(1)

GPU Programming

Introduction and Historical Notes

Franco Milicchio

https://fmilicchio.bitbucket.io

(2)

A Perspective

(3)

A Perspective

Parallel computing is ubiquitous, but it is not new at all

(4)

A Perspective

Parallel computing is ubiquitous, but it is not new at all

Astonishingly, the Analytical Engine by Charles Babbage in

~1837 was a parallel computer

(5)

A Perspective

Parallel computing is ubiquitous, but it is not new at all

Astonishingly, the Analytical Engine by Charles Babbage in

~1837 was a parallel computer

In 1958 parallel computing was introduced by S. Gill

introducing branching and waiting, while at IBM J. Cocke and D. Slotnick discussed parallelism in numerical

calculations

(6)

A Perspective

Parallel computing is ubiquitous, but it is not new at all

Astonishingly, the Analytical Engine by Charles Babbage in

~1837 was a parallel computer

In 1958 parallel computing was introduced by S. Gill

introducing branching and waiting, while at IBM J. Cocke and D. Slotnick discussed parallelism in numerical

calculations

The D825 by Burroughs Corporation, introduced n 1962,

was the first modern parallel computer with four-processors

(7)

History of GPUs

(8)

History of GPUs

In the 1970s computers were bulky and graphics was completely neglected

(9)

History of GPUs

In the 1970s computers were bulky and graphics was completely neglected

One sector, however, was interested in graphics: games

(10)

History of GPUs

In the 1970s computers were bulky and graphics was completely neglected

One sector, however, was interested in graphics: games

The industry was driven by arcade games, e.g., Namco, Taito, and many others

(11)

History of GPUs

In the 1970s computers were bulky and graphics was completely neglected

One sector, however, was interested in graphics: games

The industry was driven by arcade games, e.g., Namco, Taito, and many others

The 1980s brought a great innovation with the Amiga PC

(12)

History of GPUs

In the 1970s computers were bulky and graphics was completely neglected

One sector, however, was interested in graphics: games

The industry was driven by arcade games, e.g., Namco, Taito, and many others

The 1980s brought a great innovation with the Amiga PC

Things changed greatly in the 1990s

(13)

S3 Virge

My first graphics card (1991)

(14)

The World is 3D

(15)

The World is 3D

The 1990s saw a plethora of graphics cards with different performances (S3, ATI, Matrox, Nvidia, …)

(16)

The World is 3D

The 1990s saw a plethora of graphics cards with different performances (S3, ATI, Matrox, Nvidia, …)

They were fixed functions, hence you had to write games specifically for each of them

(17)

The World is 3D

The 1990s saw a plethora of graphics cards with different performances (S3, ATI, Matrox, Nvidia, …)

They were fixed functions, hence you had to write games specifically for each of them

Then, progressively, fixed functions were replaced by programmable functions

(18)

The World is 3D

The 1990s saw a plethora of graphics cards with different performances (S3, ATI, Matrox, Nvidia, …)

They were fixed functions, hence you had to write games specifically for each of them

Then, progressively, fixed functions were replaced by programmable functions

Now we know them by the term “GPUs” (thanks to Nvidia’s marketing team)

(19)

How Speed Changed

(20)

Over the last decades, CPUs relied on clock speed to increase performances

How Speed Changed

(21)

Over the last decades, CPUs relied on clock speed to increase performances

There is a recent flat rate of growth among all CPU vendors

How Speed Changed

(22)

Over the last decades, CPUs relied on clock speed to increase performances

There is a recent flat rate of growth among all CPU vendors

As it’s now known in industry: the free lunch is over

How Speed Changed

(23)

Over the last decades, CPUs relied on clock speed to increase performances

There is a recent flat rate of growth among all CPU vendors

As it’s now known in industry: the free lunch is over

An increasing GPU market propelled investments in research in many areas

How Speed Changed

(24)

Over the last decades, CPUs relied on clock speed to increase performances

There is a recent flat rate of growth among all CPU vendors

As it’s now known in industry: the free lunch is over

An increasing GPU market propelled investments in research in many areas

The trend is obviously rising for GPUs, flat for CPUs

How Speed Changed

(25)

CPU Clock Rate

0.1 1 10 100 1000 10000

1971 1974 1976 1979 1982 1986 1989 1995 1998 1999 2000 2005 2009

0.74

2 4

10 16 30 66

200

600 1000 2000 3000 3700

Clock Rate (MHz)

(26)

CPU Clock Rate

0.1 1 10 100 1000 10000

1971 1974 1976 1979 1982 1986 1989 1995 1998 1999 2000 2005 2009

0.74

2 4

10 16 30 66

200

600 1000 2000 3000 3700

Clock Rate (MHz)

Nothing New here

(27)

CPU Clock Rate

0.1 1 10 100 1000 10000

1971 1974 1976 1979 1982 1986 1989 1995 1998 1999 2000 2005 2009

0.74

2 4

10 16 30 66

200

600 1000 2000 3000 3700

Clock Rate (MHz)

Nothing New here

TeraHertz?

(28)

Cost of Performances

0.01 1 100 10000 1000000 100000000 10000000000

1961 1984 1997 2000 2003 2007 2011 2013 2017

$145,000,000,000.00

$42,000,000.00

$42,000.00

$1,300.00

$100.00 $52.00

$1.80

$0.16 $0.06

GFLOP Cost (USD)

(29)

GFLOPs

0 3000 6000 9000 12000

1/2003 4/2004 3/2006 5/2007 1/2009 5/2013 9/2015 4/2017

GPU CPU

(30)

Who uses GPUs?

TL;DR:


everyone, even if you don’t know

(31)

Speedup

(32)

How fast we can go is highly dependent on how we implement an algorithm

Speedup

(33)

How fast we can go is highly dependent on how we implement an algorithm

Physical architectures for high performances matter a lot and without that knowledge you won’t be fast

Speedup

(34)

How fast we can go is highly dependent on how we implement an algorithm

Physical architectures for high performances matter a lot and without that knowledge you won’t be fast

Let’s see now how a GPU works

Speedup

(35)

How fast we can go is highly dependent on how we implement an algorithm

Physical architectures for high performances matter a lot and without that knowledge you won’t be fast

Let’s see now how a GPU works

After, we will see how we can use GPU to achieve way more of just graphics

Speedup

(36)

Vertices, Cells, Textures

This mesh will be rendered into pixels

(37)

GPU Architecture

474

First, commands, textures, and vertex data are received from the host CPU through shared buffers in system memory or local frame-buffer memory. A command stream is written by the CPU, which initializes and modifies state, sends rendering commands, and references the texture and vertex data. Commands are parsed, and a vertex fetch unit is used to read the vertices referenced by the rendering commands. The commands, vertices, and state changes flow downstream, where they are used by subsequent pipeline stages.

The vertex processors (sometimes called “vertex shaders”), shown in Figure 30-4, allow for a program to be applied to each vertex in the object, performing transformations, skinning, and any other per-vertex operation the user specifies. For the first time, a

Chapter 30 The GeForce 6 Series GPU Architecture Figure 30-3. A Block Diagram of the GeForce 6 Series Architecture 430_gems2_ch30_new.qxp 1/31/2005 6:56 PM Page 474

Excerpted from GPU Gems 2

Copyright 2005 by NVIDIA Corporation

(38)

478

independent memory partitions give the GPU a wide (256 bits), flexible memory sub- system, allowing for streaming of relatively small (32-byte) memory accesses at near the 35 GB/sec physical limit.

30.2.2 Functional Block Diagram for Non-Graphics Operations

As graphics hardware becomes more and more programmable, applications unrelated to the standard polygon pipeline (as described in the preceding section) are starting to

present themselves as candidates for execution on GPUs.

Figure 30-6 shows a simplified view of the GeForce 6 Series architecture, when used as a graphics pipeline. It contains a programmable vertex engine, a programmable fragment engine, a texture load/filter engine, and a depth-compare/blending data write engine.

In this alternative view, a GPU can be seen as a large amount of programmable floating- point horsepower and memory bandwidth that can be exploited for compute-intensive applications completely unrelated to computer graphics.

Figure 30-7 shows another way to view the GeForce 6 Series architecture. When used for non-graphics applications, it can be viewed as two programmable blocks that run serially:

the vertex processor and the fragment processor, both with support for fp32 operands and intermediate values. Both use the texture unit as a random-access data fetch unit and access data at a phenomenal 35 GB/sec (550 MHz DDR memory clock 256 bits per clock

cycle 2 transfers per clock cycle). In addition, both the vertex and the fragment processor are highly computationally capable. (Performance details follow in Section 30.4.)

Chapter 30 The GeForce 6 Series GPU Architecture

Figure 30-6. The GeForce 6 Series Architecture Viewed as a Graphics Pipeline

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 478

Excerpted from GPU Gems 2

Copyright 2005 by NVIDIA Corporation

Graphics Pipeline

(39)

478

independent memory partitions give the GPU a wide (256 bits), flexible memory sub- system, allowing for streaming of relatively small (32-byte) memory accesses at near the 35 GB/sec physical limit.

30.2.2 Functional Block Diagram for Non-Graphics Operations

As graphics hardware becomes more and more programmable, applications unrelated to the standard polygon pipeline (as described in the preceding section) are starting to

present themselves as candidates for execution on GPUs.

Figure 30-6 shows a simplified view of the GeForce 6 Series architecture, when used as a graphics pipeline. It contains a programmable vertex engine, a programmable fragment engine, a texture load/filter engine, and a depth-compare/blending data write engine.

In this alternative view, a GPU can be seen as a large amount of programmable floating- point horsepower and memory bandwidth that can be exploited for compute-intensive applications completely unrelated to computer graphics.

Figure 30-7 shows another way to view the GeForce 6 Series architecture. When used for non-graphics applications, it can be viewed as two programmable blocks that run serially:

the vertex processor and the fragment processor, both with support for fp32 operands and intermediate values. Both use the texture unit as a random-access data fetch unit and access data at a phenomenal 35 GB/sec (550 MHz DDR memory clock 256 bits per clock

cycle 2 transfers per clock cycle). In addition, both the vertex and the fragment processor are highly computationally capable. (Performance details follow in Section 30.4.)

Chapter 30 The GeForce 6 Series GPU Architecture

Figure 30-6. The GeForce 6 Series Architecture Viewed as a Graphics Pipeline

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 478

Excerpted from GPU Gems 2

Copyright 2005 by NVIDIA Corporation

Graphics Pipeline

The vertex processor operates on data, passing it directly to the fragment processor, or by using the rasterizer to expand the data into interpolated values. At this point, each triangle (or point) from the vertex processor has become one or more fragments.

Before a fragment reaches the fragment processor, the z-cull unit compares the pixel’s depth with the values that already exist in the depth buffer. If the pixel’s depth is

greater, the pixel will not be visible, and there is no point shading that fragment, so the fragment processor isn’t even executed. (This optimization happens only if it’s clear that the fragment processor isn’t going to modify the fragment’s depth.) Thinking in a

general-purpose sense, this early culling feature makes it possible to quickly decide to skip work on specific fragments based on a scalar test. Chapter 34 of this book, “GPU Flow-Control Idioms,” explains how to take advantage of this feature to efficiently

predicate work for general-purpose computations.

After the fragment processor runs on a potential pixel (still a “fragment” because it has not yet reached the frame buffer), the fragment must pass a number of tests in order to move farther down the pipeline. (There may also be more than one fragment that

comes out of the fragment processor if multiple render targets [MRTs] are being used.

Up to four MRTs can be used to write out large amounts of data—up to 16 scalar floating-point values at a time, for example—plus depth.)

First, the scissor test rejects the fragment if it lies outside a specified subrectangle of the frame buffer. Although the popular graphics APIs define scissoring at this location in the pipeline, it is more efficient to perform the scissor test in the rasterizer. Scissoring in x and y actually happens in the rasterizer, before fragment processing, and z scissoring happens

30.2 Overall System Architecture 479 Figure 30-7. The GeForce 6 Series Architecture for Non-Graphics Applications

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 479

Excerpted from GPU Gems 2

Copyright 2005 by NVIDIA Corporation

(40)

CPU Architecture (Intel i7)

(41)

CPU Architecture (Intel i7)

Cores

(42)

CPU Architecture (Intel i7)

Cores

Memory Controller

(43)

CPU Architecture (Intel i7)

Cores

Memory Controller

I/O, Queue

(44)

CPU Architecture (Intel i7)

Cores

Memory Controller

I/O, Queue

Caches

(45)

CPU Architecture (Intel i7)

Cores

Memory Controller

I/O, Queue

Caches

“Multicore”

architecture

(46)

What if we got rid of everything?

(47)

CPU Architecture

(48)

CPU Architecture

Chip

(49)

CPU Architecture

Chip

Cache

(50)

CPU Architecture

Chip

Cache ALU

(51)

CPU Architecture

Chip

Cache

ALU FPU

(52)

CPU Architecture

Chip

Cache

ALU FPU

Memory Ctrl

(53)

CPU Architecture

Chip

Cache

ALU FPU

Memory Ctrl Load/Store

(54)

CPU Architecture

Chip

Cache

ALU FPU

Memory Ctrl Load/Store Prefetch

(55)

CPU Architecture

Chip

Cache

ALU FPU

Memory Ctrl Load/Store Prefetch Context

(56)

CPU Architecture

Chip

Cache

ALU FPU

Memory Ctrl Load/Store Prefetch Context

I/O

(57)

CPU Architecture

Chip

Cache

ALU FPU

Memory Ctrl Load/Store Prefetch Context

I/O

Bus

(58)

CPU Architecture

ALU FPU Chip

(59)

CPU Architecture

ALU FPU Chip

Fill this with ALUs and FPUs

(60)

GPU Architecture

(61)

Compared to CPUs, GPUs removed all components that sped up the execution of single instructions

GPU Architecture

(62)

Compared to CPUs, GPUs removed all components that sped up the execution of single instructions

No prefetching, out of order execution, branch prediction and so on, nothing of that: just a minimal context

GPU Architecture

(63)

Compared to CPUs, GPUs removed all components that sped up the execution of single instructions

No prefetching, out of order execution, branch prediction and so on, nothing of that: just a minimal context

All of the remaining chip area can be now filled with ALUs and can therefore run multiple threads

GPU Architecture

(64)

Compared to CPUs, GPUs removed all components that sped up the execution of single instructions

No prefetching, out of order execution, branch prediction and so on, nothing of that: just a minimal context

All of the remaining chip area can be now filled with ALUs and can therefore run multiple threads

Since GPUs actually share control among cores, one

single instruction is executed by several threads at once by adding execution units instead of control ones

GPU Architecture

(65)

SIMD Architecture

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

(66)

SIMD Architecture

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

Many

Independent Execution Units

(67)

SIMD Architecture

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

Many

Independent Execution Units Many

Shared Controls

(68)

SIMD Architecture

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

Many

Independent Execution Units Many

Shared Controls

Some Shared Memory

(69)

SIMD Architecture

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

Many

Independent Execution Units Many

Shared Controls

One Huge
 Shared Memory Some Shared

Memory

(70)

SIMD Architecture

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

Many

Independent Execution Units

“Manycore”

architecture Many

Shared Controls

One Huge
 Shared Memory Some Shared

Memory

(71)

Try It Yourself

Use Amazon EC2 or similar for free

(72)

Historical Notes

(73)

When shaders were available, people from academia

started using GPUs for something different from graphics

Historical Notes

(74)

When shaders were available, people from academia

started using GPUs for something different from graphics

Matrix multiplication was easy and developed in 2001, while LU decomposition was written in 2005

Historical Notes

(75)

When shaders were available, people from academia

started using GPUs for something different from graphics

Matrix multiplication was easy and developed in 2001, while LU decomposition was written in 2005

Back then one had to use OpenGL and DirectX

Historical Notes

(76)

When shaders were available, people from academia

started using GPUs for something different from graphics

Matrix multiplication was easy and developed in 2001, while LU decomposition was written in 2005

Back then one had to use OpenGL and DirectX

Some programming languages emerged, like Brook or Sh, but NVidia changed the development perspective

Historical Notes

(77)

When shaders were available, people from academia

started using GPUs for something different from graphics

Matrix multiplication was easy and developed in 2001, while LU decomposition was written in 2005

Back then one had to use OpenGL and DirectX

Some programming languages emerged, like Brook or Sh, but NVidia changed the development perspective

In 2007 they released CUDA

Historical Notes

(78)

CUDA

(79)

CUDA

CUDA is acronym for “Compute Unified Device Architecture”

(80)

CUDA

CUDA is acronym for “Compute Unified Device Architecture”

It is a general purpose parallel computing architecture

(81)

CUDA

CUDA is acronym for “Compute Unified Device Architecture”

It is a general purpose parallel computing architecture

It comprises hardware, compilers (C, C++, and FORTRAN), and several libraries

(82)

CUDA

CUDA is acronym for “Compute Unified Device Architecture”

It is a general purpose parallel computing architecture

It comprises hardware, compilers (C, C++, and FORTRAN), and several libraries

It may use several GPUs, if present on the device

(83)

CUDA

CUDA is acronym for “Compute Unified Device Architecture”

It is a general purpose parallel computing architecture

It comprises hardware, compilers (C, C++, and FORTRAN), and several libraries

It may use several GPUs, if present on the device

Obviously, it runs only on NVidia’s hardware (but compiles on every computer)

(84)

Main Characteristics

(85)

Main Characteristics

Nvidia suggests to use its extension to common ISO C

(86)

Main Characteristics

Nvidia suggests to use its extension to common ISO C

Threads are a native concept

(87)

Main Characteristics

Nvidia suggests to use its extension to common ISO C

Threads are a native concept

CUDA is a shared memory architecture, and it supports barriers and synchronization calls

(88)

Main Characteristics

Nvidia suggests to use its extension to common ISO C

Threads are a native concept

CUDA is a shared memory architecture, and it supports barriers and synchronization calls

A CUDA program is GPU “independent” (sort of)

(89)

Main Characteristics

Nvidia suggests to use its extension to common ISO C

Threads are a native concept

CUDA is a shared memory architecture, and it supports barriers and synchronization calls

A CUDA program is GPU “independent” (sort of)

It runs the same program on different GPUs

(90)

Vector Sum

(91)

Vector Sum

// Kernel definition

__global__ void VecAdd(float* A, float* B, float* C) {

int i = threadIdx.x;

C[i] = A[i] + B[i];

}

int main(void) {

// ...

// Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C);

// ...

}

(92)

Vector Sum

// Kernel definition

__global__ void VecAdd(float* A, float* B, float* C) {

int i = threadIdx.x;

C[i] = A[i] + B[i];

}

int main(void) {

// ...

// Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C);

// ...

}

(93)

Vector Sum

(94)

Vector Sum

A

(95)

Vector Sum

A B

(96)

Vector Sum

A B

C

(97)

Vector Sum

A B

C

i = threadIdx.x

(98)

Vector Sum

A B

C

i = threadIdx.x

C[i] = A[i] + B[i]

(99)

Vector Sum

A B

C

i = threadIdx.x

C[i] = A[i] + B[i]

(100)

Vector Sum

A B

C

i = threadIdx.x

C[i] = A[i] + B[i]

VecAdd<<<1, N>>>(A, B, C);

(101)

Vector Sum

A B

C

i = threadIdx.x

C[i] = A[i] + B[i]

VecAdd<<<1, N>>>(A, B, C);

1-D block of threads

(102)

What about branches?

Single control, multiple threads

(103)

Branching without Control

(104)

Branching without Control

// ...

if (i > 3.1415) {

// ...

}

else {

// ...

}

(105)

Branching without Control

// ...

if (i > 3.1415) {

// ...

}

else {

// ...

}

Time

(106)

Branching without Control

// ...

if (i > 3.1415) {

// ...

}

else {

// ...

}

Time

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

(107)

Branching without Control

// ...

if (i > 3.1415) {

// ...

}

else {

// ...

}

Time

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

(108)

Branching without Control

// ...

if (i > 3.1415) {

// ...

}

else {

// ...

}

Time

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

(109)

Branching without Control

// ...

if (i > 3.1415) {

// ...

}

else {

// ...

}

Time

Chapter(1.(Introduction!

!

CUDA(Programming(Guide(Version(3.0! ! 3

!

Figure(1>2.( The(GPU(Devotes(More(Transistors(to(Data(

Processing(

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations the same program is executed on many data elements in parallel with high arithmetic intensity the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and

because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel

processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal

processing or physics simulation to computational finance or computational biology.

1.2 CUDA :(a(General>Purpose(Parallel(

Computing(Architecture(

In November 2006, NVIDIA introduced CUDA a general purpose parallel

computing architecture with a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a high-level programming language. As illustrated by Figure 1-3, other languages or application programming interfaces are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute.

Cache!

Control! ALU!

ALU!

ALU!

ALU!

DRAM!

CPU!

DRAM!

GPU!

0 0 0 1 0 0 1 1 0 1 0 0 0 1 1 1

Riferimenti

Documenti correlati

There- fore an important development of the present work would be represented by the implementation of the developed algorithms on GPU-base hardware, which would allow the

• In fact, threads in a block can access a very fast shared (private) memory, visible only inside the block. • Then we could just read blockDim.x items to the shared

What this does is make separate copies of me as global variables, one for each thread. As globals, GCC won’t engage in any shenanigans with them. :-) One does have to keep in mind

• When the __device__ qualifier is used by itself, it declares the variable resides in global memory. Is accessible from all the threads within the grid and from the host through

• When all threads of a warp execute a load instruction, if all accessed locations fall into the same burst section, only one DRAM request will be made and the access is fully

In the first iteration of the while loop, each thread index the input buffer using its global thread index: Thread 0 accesses element 0, Thread 1. accesses element

TILE_WIDTH=16, all the 256 threads of a thread block are involved in filling the upper part of the shared memory above the green line, while 145 are involved in filling the

• Historically BLAS has been developed in Fortran, and to maintain compatibility CuBLAS uses column-major. storage, and