• Non ci sono risultati.

An Introduction to An Introduction to Parallel Programming Parallel Programming

N/A
N/A
Protected

Academic year: 2021

Condividi "An Introduction to An Introduction to Parallel Programming Parallel Programming"

Copied!
56
0
0

Testo completo

(1)

An Introduction to An Introduction to

Parallel Programming Parallel Programming

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

http://www.moreno.marzolla.name/

(2)

Intro to Parallel Programming 2

(3)

Intro to Parallel Programming 3

About this course

Moreno Marzolla

http://www.moreno.marzolla.name/

Research interests: parallel programming; modeling and simulation; software engineering

This course

“Bird's eye view” of parallel programming

Emphasis on practical aspects instead of theoretical foundations

OpenMP and CUDA

Final exam

Short paper (≥ 4 pages) describing a small programming project and/or an HPC algorithm/application of your interest

Please discuss the paper's topic with the instructor before starting to work on it

(4)

Intro to Parallel Programming 4

Before we start

Try to access isi-raptor03.csr.unibo.it

Username: your full email address first.lastname@unibo.it

Password: [your UniBO password]

From a Unix/Linux client:

ssh first.lastname\@unibo.it@isi-raptor03.csr.unibo.it

(5)

Intro to Parallel Programming 5

Moore's Law

"The number of transistors on an IC doubles every 24 months"

That used to mean that every new generation of processors was based on smaller transistors

Moore, G.E., Cramming more components onto

integrated circuits. Electronics, 38(8), April 1965 Gordon E.

Moore (1929– )

(6)

Intro to Parallel Programming 6 By Wgsimon - Own work, CC BY-SA 3.0,

https://commons.wikimedia.org/w/index.php?curid=15193542

(7)

Intro to Parallel Programming 7

Physics lesson

Smaller transistors → Faster processor

Faster processor → Higher power consumption

Higher power consumption → More heat produced

More heat produced → Unreliable processor

(8)

Intro to Parallel Programming 8

Power

The power required by an IC (e.g., a processor) can be expressed as

Power = C ´ V 2 ´ f where:

C is the capacitance (ability of a circuit to store energy)

V is the voltage

f is the frequency at which the processor operates

(9)

Intro to Parallel Programming 9

Power

Processor

Input Output

f

Processor

Processor

Input Output

f / 2 f / 2

f

Credits: Tim Mattson

Capacitance C Voltage V

Frequency f Power = C V 2 f

Capacitance 2.2 C Voltage 0.6 V

Frequency 0.5 f

Power = 0.396 C V 2 f

(10)
(11)

Intro to Parallel Programming 11

What happens today?

HW designers create processors with more cores

Result:

parallel hardware is ubiquitous

parallel software is rare

The challenge

Make parallel

software as common as parallel hardware

NVidia Tegra 4 SoC

(12)

Intro to Parallel Programming 12

Applications:

Numerical Wind Tunnel

Source: http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html

(13)

Applications:

Molecular dynamics

(14)
(15)

Intro to Parallel Programming 15

Applications:

Cosmological Simulation

Bolshoi simulation https://vimeo.com/29769051

The Bolshoi Simulation recreates the large-scale structure of the universe;

it required 6 million CPU hours on NASA's Pleiades Supercomputer

Source :

https://www.nas.nasa.gov/hecc/resources/pleiades.html

(16)

Intro to Parallel Programming 16

Example

Sum-reduction of an array

("Hello, world!" of parallel programming)

(17)

Intro to Parallel Programming 17

Sum-Reduction

We start assuming a shared-memory architecture

All execution units share a common memory space

We begin with a sequential solution and parallelize it

This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!

However, it is sometimes a reasonable starting point

Credits: Mary Hall

(18)

Intro to Parallel Programming 18

Sequential algorithm

Compute the sum of the content of an array A of length n

float seq_sum(float* A, int n) {

int i;

float sum = 0.0;

for (i=0; i<n; i++) { sum += A[i];

}

return sum;

}

Credits: Mary Hall

(19)

Intro to Parallel Programming 19

Version 1 (Wrong!)

Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

sum += my_x;

}

Proc 0 Proc 1 Proc 2

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

(20)

Intro to Parallel Programming 20

Version 1 (Wrong!)

Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

sum += my_x;

}

Proc 0 Proc 1 Proc 2

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

Race condition!

(21)

Intro to Parallel Programming 21

Version 1 (better, but still wrong)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

mutex_lock(&m);

sum += my_x;

mutex_unlock(&m);

}

WRONG

Proc 0 Proc 1 Proc 2

(22)

Intro to Parallel Programming 22

Version 1 (better, but still wrong)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Esempio: n = 17, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

mutex_lock(&m);

sum += my_x;

mutex_unlock(&m);

}

?? ??

Proc 0 Proc 1 Proc 2

WRONG

(23)

Intro to Parallel Programming 23

Version 1

(correct, but not efficient)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Example: n = 17, P = 3

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

mutex_lock(&m);

sum += my_x;

mutex_unlock(&m);

}

Proc 0 Proc 1 Proc 2

(24)

Intro to Parallel Programming 24

Version 2

Too much contention on mutex m

Each processor acquires and releases the mutex for each element of the array!

Solution: increase the mutex granularity

Each processor accumulates the partial sum on a local (private) variable

The mutex is used at the end to update the global sum

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; my_sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

my_sum += my_x;

}

mutex_lock(&m);

sum += my_sum;

mutex_unlock(&m);

(25)

Intro to Parallel Programming 25

Version 3: Remove the mutex

(wrong, in a subtle way)

We use a shared array psum[] where each processor can store its local sum

At the end, one processor computes the global sum

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0; /* all elements set to 0.0 */

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

psum[my_id] += my_x;

}

if ( 0 == my_id ) { /* only the master executes this */

sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

WRONG

(26)

Intro to Parallel Programming 26

The problem with version 3

Processor 0 could start the computation of the global sum before all other processors have computed the local sums!

Compute local sums

Compute global sum

P0 P1 P2 P3

(27)

Intro to Parallel Programming 27

Version 4 (correct)

Use a barrier synchronization

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

psum[my_id] += my_x;

}

barrier();

if ( 0 == my_id ) { sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

Compute local sums

Compute global sum

P0 P1 P2 P3

barrier()

(28)

Intro to Parallel Programming 28

Version 5

Distributed-memory version

P << n processors

Each processor

computes a local sum

Each processor

sends the local sum to processor 0 (the master)

...my_sum = 0.0;

my_start = …, my_end = …;

for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);

}

if ( 0 == my_id ) {

for ( i=1; i<P; i++ ) {

tmp = receive from proc i;

my_sum += tmp;

}

printf(“The sum is %f\n”, my_sum);

} else {

send my_sum to thread 0;

}

(29)

Intro to Parallel Programming 29

Version 5

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

15 4 2 9 3 8 11

(30)

Intro to Parallel Programming 30

Version 5

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

15 4 2 9 3 8 11

Bottleneck

(31)

Intro to Parallel Programming 31

Parallel reduction

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

5 -1 7

6

15 4

9

(P – 1) sums are still performed; however, processor 0 receives

~ log2 P messages and performs ~ log2 P sums

(32)

Intro to Parallel Programming 32

Parallel architectures

(33)

Intro to Parallel Programming 33

Flynn's Taxonomy

SISD

Single Instruction Stream Single Data Stream

SIMD

Single Instruction Stream Multiple Data Streams

MISD

Multiple Instruction Streams Single Data Stream

MIMD

Multiple Instruction Streams Multiple Data Streams

Data Streams

Single Multiple

Instruction Streams MultipleSingle

Von Neumann architecture

(34)

Intro to Parallel Programming 34

Single Instruction Multiple Data

SIMD instructions apply the same operation (e.g.,

sum, product…) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands)

This means that there must be 4/8/... independent ALUs

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

STORE C[0]

LOAD A[1]

LOAD B[1]

C[1] = A[1] + B[1]

STORE C[1]

LOAD A[2]

LOAD B[2]

C[2] = A[2] + B[2]

STORE C[2]

LOAD A[3]

LOAD B[3]

C[3] = A[3] + B[3]

STORE C[3]

Time

(35)

Intro to Parallel Programming 36

GPU

Fermi GPU die (source:

http://www.legitreviews.com/article/1100/1/)

Modern GPUs

(Graphics Processing Units) can be regarded as a special kind of

SIMD architecture

(36)

Intro to Parallel Programming 37

MIMD

In MIMD systems there are multiple execution units that can execute multiple sequences of instructions

Multiple Instruction Streams

Each execution unit might operate on a different input

Multiple Data Streams

CALL F() z = 8 y = 1.7 z = x + y

a = 18 b = 9 if ( a>b ) c = 7

a = a - 1

w = 7 t = 13 k = G(w,t)

k = k + 1

Time

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

STORE C[0]

(37)

Intro to Parallel Programming 38

MIMD architectures

Shared Memory

A set of processors sharing a common memory space

Each processor can access any memory location

Distributed Memory

A set of compute nodes connected through an interconnection

network

The most simple example:

cluster of PCs connected via Ethernet

Nodes must communicate explicitly to share data

Memory

CPU CPU CPU

Interconnect

CPU

CPU Mem

CPU Mem

CPU Mem

CPU Mem

Interconnect

(38)
(39)

Intro to Parallel Programming 40

Hybrid architectures

Many HPC systems are based on hybrid architectures

Each compute node is a shared-memory multiprocessor

Many compute nodes are connected through a network

CPU Mem

Interconnect CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

(40)

Intro to Parallel Programming 41

Recap

Shared memory

Advantages:

Easy to program

Useful for applications with irregular data access patterns (e.g., graph algorithms)

Disadvantages:

The programmer must take care of race conditions

Limited memory bandwidth

Distributed memory

Advantages:

Scalable, provides high

computational power by adding more nodes

Useful for applications with strong locality of reference, with high computation /

communication ratio

Disadvantages:

Latency of interconnect network

Difficult to program

(41)

Intro to Parallel Programming 42

Efficiency and scalability of parallel programs

(42)

Intro to Parallel Programming 43

Scalability

How much faster can a given problem be solved with p workers instead of one?

How much more work can be done with p workers instead of one?

What impact for the communication requirements of the parallel application have on performance?

What fraction of the resources is actually used productively for solving the problem?

Credits: Daivd Padua

(43)

Intro to Parallel Programming 44

Speedup

Let us define:

p = Number of processors / cores

Tserial = Execution time of the serial program

Tparallel (p) = Execution time of the parallel program with p

processors / cores

(44)

Intro to Parallel Programming 45

Speedup

Speedup S(p)

In the ideal case, the parallel program requires 1/p the time of the sequential program

S(p) = p is the ideal case of linear speedup

Realistically, S(p) ≤ p

Is it possible to observe S(p) > p ?

S ( p)= T

serial

T

parallel

( p)

Usually Tserial = Tparallel(1)

(45)

Intro to Parallel Programming 46

Non-parallelizable portions

Suppose that a fraction α of the total execution time of the serial program can not be parallelized

E.g., due to:

Algorithmic limitations (data dependencies)

Bottlenecks (e.g., shared resources)

Startup overhead

Communication costs

Suppose that the remaining fraction (1 - α) can be fully parallelized

Then, we have:

Tparallel( p) = α T serial+(1−α)Tserial p

(46)

Intro to Parallel Programming 47

Example

Intrinsecally serial part

Parallelizable part

p = 1 p = 2 p = 4

Tserial α Tserial(1- α) Tserial

T

parallel

( p)=α T

serial

+ ( 1−α)T

serial

p

Tparallel(2) Tparallel(4)

(47)

Intro to Parallel Programming 48

Example

Suppose that a program has Tserial = 20s

Assume that 10% of the time is spent in a serial portion of the program

Therefore, the execution time of a parallel version with p processors is

T parallel( p) = 0.1Tserial+ 0.9 T serial

p = 2+ 18 p

(48)

Intro to Parallel Programming 49

Example (cont.)

The speedup is

What is the maximum speedup that can be achieved when p → +∞ ?

S ( p) = T

serial

0.1×T

serial

+ 0.9×T

serial

p

= 20 2+ 18

p

(49)

Intro to Parallel Programming 50

S ( p) = T

serial

T

parallel

( p)

= T

serial

α T

serial

+ ( 1−α)T

serial

p

= 1

α+ 1−α p

Amdahl's Law

What is the maximum speedup?

Gene Myron Amdahl (1922—)

Amdahl's Law

(50)

Intro to Parallel Programming 51

Amdahl's Law

From

we get an asymptotic speedup 1 / α when p grows to infinity

S ( p)= 1

α+ ( 1−α) p

If a fraction α of the total execution time is spent on a serial portion of the program, then the

maximum achievable speedup is 1/α

(51)
(52)

Intro to Parallel Programming 53

Scaling Efficiency

Strong Scaling: increase the number of processors p keeping the total problem size fixed

The total amount of work remains constant

The amount of work for a single processor decreases as p increases

Goal: reduce the total execution time by adding more processors

Weak Scaling: increase the number of processors p keeping the per-processor work fixed

The total amount of work grows as p increases

Goal: solve larger problems within the same amount of time

(53)

Intro to Parallel Programming 54

Strong Scaling Efficiency

E(p) = Strong Scaling Efficiency

where

Tserial = Execution time of the serial program

Tparallel (p) = Execution time of the parallel program with p

processors / cores

E ( p) = S ( p)

p = T

serial

p×T

parallel

( p)

(54)

55

Strong Scaling Efficiency and Amdahl's Law

(55)

Intro to Parallel Programming 56

Weak Scaling Efficiency

W(p) = Weak Scaling Efficiency

where

T1 = time required to complete 1 work unit with 1 processor

Tp = time required to complete p work units with p processors

W ( p)= T

1

T

p

(56)

Intro to Parallel Programming 57

Example

See omp-matmul.c

demo-strong-scaling.sh computes the times required for strong scaling (demo-strong-scaling.ods)

demo-weak-scaling.sh computes the times required for weak scaling (demo-weak-scaling.ods)

Important note

For a given n, the amount of “work” required to compute the product of two n ´ n matrices is ~ n3

To double the total work, do not double n!

The amount of work required to compute the product of two (2n ´ 2n) matrices is ~ 8n3, eight times the n ´ n case

To double the work, we must use matrices of size

In general, with p processes we use matrices of size

(3 2n × 3 2 n)

(3 p n × 3 p n)

Riferimenti

Documenti correlati

•  Sono semplici file di testo con estensione .html che possono essere modificati con un semplice editor di testo!. •  Non ci sono informazioni sullo stile da usare

• Non-Blocking Linked List uses CAS operation to concurrently update pointers connecting nodes. • Each node has a key, ordered throughout

In the previous example, the problem is that after the function str_swap() returns, the address of the allocated memory is lost (it was stored in a local variable), so we cannot

Usually, declaration and definition coincide for variables The definition consists of the type keyword followed by the name of the variable, followed by the “;”

However, in many cases we must execute alternative instructions, depending on the value of certain expressions Also, sometimes we need to repeat instructions a number of times, or

3 Write a function that given a string, substitutes all lower case characters to

When we want to define a pointer that can point to a variable of any type, we specify it as a void pointer.

For example the return value of a functio can be specified as void, to mean that we are not returning any value When we want to define a pointer that can point to a variable of