An Introduction to An Introduction to Parallel Programming Parallel Programming

(1)

An Introduction to An Introduction to

Parallel Programming Parallel Programming

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

http://www.moreno.marzolla.name/

(2)

Intro to Parallel Programming 2

(3)

About this course

● Moreno Marzolla

– http://www.moreno.marzolla.name/

– Research interests: parallel programming; modeling and simulation; software engineering

● This course

– “Bird's eye view” of parallel programming

– Emphasis on practical aspects instead of theoretical foundations

– OpenMP and CUDA

● Final exam

– Short paper (≥ 4 pages) describing a small programming project and/or an HPC algorithm/application of your interest

– Please discuss the paper's topic with the instructor before starting to work on it

(4)

Before we start

● Try to access isi-raptor03.csr.unibo.it

– Username: your full email address first.lastname@unibo.it

– Password: [your UniBO password]

● From a Unix/Linux client:

ssh first.lastname\@unibo.it@isi-raptor03.csr.unibo.it

(5)

Moore's Law

"The number of transistors on an IC doubles every 24 months"

● That used to mean that every new generation of processors was based on smaller transistors

Moore, G.E., Cramming more components onto

integrated circuits. Electronics, 38(8), April 1965 Gordon E.

Moore (1929– )

(6)

Intro to Parallel Programming 6 By Wgsimon - Own work, CC BY-SA 3.0,

https://commons.wikimedia.org/w/index.php?curid=15193542

(7)

Physics lesson

● Smaller transistors → Faster processor

● Faster processor → Higher power consumption

● Higher power consumption → More heat produced

● More heat produced → Unreliable processor

(8)

Power

● The power required by an IC (e.g., a processor) can be expressed as

Power = C ´ V ²´ f where:

– C is the capacitance (ability of a circuit to store energy)

– V is the voltage

– f is the frequency at which the processor operates

(9)

Power

Processor

Input Output

f

Processor

Input Output

f / 2 f / 2

f

Credits: Tim Mattson

Capacitance C Voltage V

Frequency f Power = C V ² f

Capacitance 2.2 C Voltage 0.6 V

Frequency 0.5 f

Power = 0.396 C V ² f

(10)

(11)

What happens today?

● HW designers create processors with more cores

● Result:

– parallel hardware is ubiquitous

– parallel software is rare

● The challenge

– Make parallel

software as common as parallel hardware

NVidia Tegra 4 SoC

(12)

Applications:

Numerical Wind Tunnel

Source: http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html

(13)

Applications:

Molecular dynamics

(14)

(15)

Applications:

Cosmological Simulation

Bolshoi simulation https://vimeo.com/29769051

The Bolshoi Simulation recreates the large-scale structure of the universe;

it required 6 million CPU hours on NASA's Pleiades Supercomputer

Source :

https://www.nas.nasa.gov/hecc/resources/pleiades.html

(16)

Example

Sum-reduction of an array

("Hello, world!" of parallel programming)

(17)

Sum-Reduction

● We start assuming a shared-memory architecture

– All execution units share a common memory space

● We begin with a sequential solution and parallelize it

– This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!

– However, it is sometimes a reasonable starting point

Credits: Mary Hall

(18)

Sequential algorithm

● Compute the sum of the content of an array A of length n

float seq_sum(float* A, int n) {

int i;

float sum = 0.0;

for (i=0; i<n; i++) { sum += A[i];

}

return sum;

}

Credits: Mary Hall

(19)

Version 1 (Wrong!)

● Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

● Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

sum += my_x;

}

Proc 0 Proc 1 Proc 2

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

(20)

Version 1 (Wrong!)

● Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

● Example: n = 15, P = 3

my_block_len = n/P;

sum = 0.0;

sum += my_x;

}

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

Race condition!

(21)

Version 1 (better, but still wrong)

● Assuming P processors, each one computes a partial sum of n / P adjacent elements

● Example: n = 15, P = 3

my_block_len = n/P;

sum = 0.0; mutex m;

mutex_lock(&m);

sum += my_x;

mutex_unlock(&m);

}

WRONG

(22)

Version 1 (better, but still wrong)

● Esempio: n = 17, P = 3

my_block_len = n/P;

sum = 0.0; mutex m;

mutex_lock(&m);

sum += my_x;

}

?? ??

WRONG

(23)

Version 1

(correct, but not efficient)

● Example: n = 17, P = 3

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; mutex m;

mutex_lock(&m);

sum += my_x;

}

(24)

Version 2

● Too much contention on mutex m

– Each processor acquires and releases the mutex for each element of the array!

● Solution: increase the mutex granularity

– Each processor accumulates the partial sum on a local (private) variable

– The mutex is used at the end to update the global sum

my_end = n * (my_id + 1) / P;

sum = 0.0; my_sum = 0.0; mutex m;

my_sum += my_x;

}

mutex_lock(&m);

sum += my_sum;

mutex_unlock(&m);

(25)

Version 3: Remove the mutex

(wrong, in a subtle way)

● We use a shared array psum[] where each processor can store its local sum

● At the end, one processor computes the global sum

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0; /* all elements set to 0.0 */

psum[my_id] += my_x;

}

if ( 0 == my_id ) { /* only the master executes this */

sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

WRONG

(26)

The problem with version 3

● Processor 0 could start the computation of the global sum before all other processors have computed the local sums!

Compute local sums

Compute global sum

P0 P1 P2 P3

(27)

Version 4 (correct)

● Use a barrier synchronization

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0;

psum[my_id] += my_x;

}

barrier();

if ( 0 == my_id ) { sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

Compute local sums

Compute global sum

P0 P1 P2 P3

barrier()

(28)

Version 5

Distributed-memory version

● P << n processors

● Each processor

computes a local sum

● Each processor

sends the local sum to processor 0 (the master)

...my_sum = 0.0;

my_start = …, my_end = …;

for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);

}

if ( 0 == my_id ) {

for ( i=1; i<P; i++ ) {

tmp = receive from proc i;

my_sum += tmp;

}

printf(“The sum is %f\n”, my_sum);

} else {

send my_sum to thread 0;

}

(29)

Version 5

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

15 4 2 9 3 8 11

(30)

Version 5

Proc 0

A[ ]

my_sum

1 3 -2 7 -6 5 3 4

15 4 2 9 3 8 11

Bottleneck

(31)

Parallel reduction

Proc 0

A[ ]

my_sum

1 3 -2 7 -6 5 3 4

5 -1 7

6

15 4

9

● (P – 1) sums are still performed; however, processor 0 receives

~ log₂ P messages and performs ~ log₂ P sums

(32)

Parallel architectures

(33)

Flynn's Taxonomy

SISD

Single Instruction Stream Single Data Stream

SIMD

Single Instruction Stream Multiple Data Streams

MISD

Multiple Instruction Streams Single Data Stream

MIMD

Multiple Instruction Streams Multiple Data Streams

Data Streams

Single Multiple

Instruction Streams MultipleSingle

Von Neumann architecture

(34)

Single Instruction Multiple Data

● SIMD instructions apply the same operation (e.g.,

sum, product…) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands)

– This means that there must be 4/8/... independent ALUs

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

STORE C[0]

LOAD A[1]

LOAD B[1]

C[1] = A[1] + B[1]

STORE C[1]

LOAD A[2]

LOAD B[2]

C[2] = A[2] + B[2]

STORE C[2]

LOAD A[3]

LOAD B[3]

C[3] = A[3] + B[3]

STORE C[3]

Time

(35)

GPU

Fermi GPU die (source:

http://www.legitreviews.com/article/1100/1/)

● Modern GPUs

(Graphics Processing Units) can be regarded as a special kind of

SIMD architecture

(36)

MIMD

● In MIMD systems there are multiple execution units that can execute multiple sequences of instructions

– Multiple Instruction Streams

● Each execution unit might operate on a different input

– Multiple Data Streams

CALL F() z = 8 y = 1.7 z = x + y

a = 18 b = 9 if ( a>b ) c = 7

a = a - 1

w = 7 t = 13 k = G(w,t)

k = k + 1

Time

LOAD A[0]

LOAD B[0]

C[0] = A[0] + B[0]

STORE C[0]

(37)

MIMD architectures

● Shared Memory

– A set of processors sharing a common memory space

– Each processor can access any memory location

● Distributed Memory

– A set of compute nodes connected through an interconnection

network

● The most simple example:

cluster of PCs connected via Ethernet

– Nodes must communicate explicitly to share data

Memory

CPU CPU CPU

Interconnect

CPU

CPU Mem

Interconnect

(38)

(39)

Hybrid architectures

● Many HPC systems are based on hybrid architectures

– Each compute node is a shared-memory multiprocessor

– Many compute nodes are connected through a network

CPU Mem

Interconnect CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

CPU Mem CPU

CPU CPU GPU GPU

(40)

Recap

Shared memory

● Advantages:

– Easy to program

– Useful for applications with irregular data access patterns (e.g., graph algorithms)

● Disadvantages:

– The programmer must take care of race conditions

– Limited memory bandwidth

Distributed memory

● Advantages:

– Scalable, provides high

computational power by adding more nodes

– Useful for applications with strong locality of reference, with high computation /

communication ratio

● Disadvantages:

– Latency of interconnect network

– Difficult to program

(41)

Efficiency and scalability of parallel programs

(42)

Scalability

● How much faster can a given problem be solved with p workers instead of one?

● How much more work can be done with p workers instead of one?

● What impact for the communication requirements of the parallel application have on performance?

● What fraction of the resources is actually used productively for solving the problem?

Credits: Daivd Padua

(43)

Speedup

● Let us define:

– p = Number of processors / cores

– T_serial = Execution time of the serial program

– T_parallel(p) = Execution time of the parallel program with p

processors / cores

(44)

Speedup

● Speedup S(p)

● In the ideal case, the parallel program requires 1/p the time of the sequential program

● S(p) = p is the ideal case of linear speedup

– Realistically, S(p) ≤ p

– Is it possible to observe S(p) > p ?

S ( p)= T

_serial

T

_parallel

( p)

Usually T_serial = T_parallel(1)

(45)

Non-parallelizable portions

● Suppose that a fraction α of the total execution time of the serial program can not be parallelized

– E.g., due to:

● Algorithmic limitations (data dependencies)

● Bottlenecks (e.g., shared resources)

● Startup overhead

● Communication costs

● Suppose that the remaining fraction (1 - α) can be fully parallelized

● Then, we have:

T_parallel( p) = α T _serial+(1−α)T_serial p

(46)

Example

Intrinsecally serial part

Parallelizable part

p = 1 p = 2 p = 4

Tserial α Tserial(1- α) Tserial

T

_parallel

( p)=α T

_serial

+ ( 1−α)T

_serial

p

Tparallel(2) Tparallel(4)

(47)

Example

● Suppose that a program has T_serial = 20s

● Assume that 10% of the time is spent in a serial portion of the program

● Therefore, the execution time of a parallel version with p processors is

T _parallel( p) = 0.1T_serial+ 0.9 T _serial

p = 2+ 18 p

(48)

Example (cont.)

● The speedup is

● What is the maximum speedup that can be achieved when p → +∞ ?

S ( p) = T

_serial

0.1×T

_serial

+ 0.9×T

_serial

p

= 20 2+ 18

p

(49)

S ( p) = T

_serial

T

_parallel

( p)

= T

_serial

α T

_serial

+ ( 1−α)T

_serial

p

= 1

α+ 1−α p

Amdahl's Law

● What is the maximum speedup?

Gene Myron Amdahl (1922—)

Amdahl's Law

(50)

Amdahl's Law

● From

we get an asymptotic speedup 1 / α when p grows to infinity

S ( p)= 1

α+ ( 1−α) p

If a fraction α of the total execution time is spent on a serial portion of the program, then the

maximum achievable speedup is 1/α

(51)

(52)

Scaling Efficiency

● Strong Scaling: increase the number of processors p keeping the total problem size fixed

– The total amount of work remains constant

– The amount of work for a single processor decreases as p increases

– Goal: reduce the total execution time by adding more processors

● Weak Scaling: increase the number of processors p keeping the per-processor work fixed

– The total amount of work grows as p increases

– Goal: solve larger problems within the same amount of time

(53)

Strong Scaling Efficiency

● E(p) = Strong Scaling Efficiency

● where

– T_serial = Execution time of the serial program

– T_parallel(p) = Execution time of the parallel program with p

processors / cores

E ( p) = S ( p)

p = T

_serial

p×T

_parallel

( p)

(54)

55

Strong Scaling Efficiency and Amdahl's Law

(55)

Weak Scaling Efficiency

● W(p) = Weak Scaling Efficiency

● where

– T₁ = time required to complete 1 work unit with 1 processor

– T_p = time required to complete p work units with p processors

W ( p)= T

₁

T

_p

(56)

Example

● See omp-matmul.c

– demo-strong-scaling.sh computes the times required for strong scaling (demo-strong-scaling.ods)

– demo-weak-scaling.sh computes the times required for weak scaling (demo-weak-scaling.ods)

● Important note

– For a given n, the amount of “work” required to compute the product of two n ´ n matrices is ~ n³

– To double the total work, do not double n!

● The amount of work required to compute the product of two (2n ´ 2n) matrices is ~ 8n³, eight times the n ´ n case

– To double the work, we must use matrices of size

– In general, with p processes we use matrices of size

(√³ ^{2n ×} √³ ^{2 n)}

(√³ ^{p n ×} √³ ^{p n)}