An Introduction to An Introduction to
Parallel Programming Parallel Programming
Moreno Marzolla
Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna
http://www.moreno.marzolla.name/
Intro to Parallel Programming 2
Intro to Parallel Programming 3
About this course
● Moreno Marzolla
– http://www.moreno.marzolla.name/
– Research interests: parallel programming; modeling and simulation; software engineering
● This course
– “Bird's eye view” of parallel programming
– Emphasis on practical aspects instead of theoretical foundations
– OpenMP and CUDA
● Final exam
– Short paper (≥ 4 pages) describing a small programming project and/or an HPC algorithm/application of your interest
– Please discuss the paper's topic with the instructor before starting to work on it
Intro to Parallel Programming 4
Before we start
● Try to access isi-raptor03.csr.unibo.it
– Username: your full email address first.lastname@unibo.it
– Password: [your UniBO password]
● From a Unix/Linux client:
ssh first.lastname\@unibo.it@isi-raptor03.csr.unibo.it
Intro to Parallel Programming 5
Moore's Law
"The number of transistors on an IC doubles every 24 months"
● That used to mean that every new generation of processors was based on smaller transistors
Moore, G.E., Cramming more components onto
integrated circuits. Electronics, 38(8), April 1965 Gordon E.
Moore (1929– )
Intro to Parallel Programming 6 By Wgsimon - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=15193542
Intro to Parallel Programming 7
Physics lesson
● Smaller transistors → Faster processor
● Faster processor → Higher power consumption
● Higher power consumption → More heat produced
● More heat produced → Unreliable processor
Intro to Parallel Programming 8
Power
● The power required by an IC (e.g., a processor) can be expressed as
Power = C ´ V 2 ´ f where:
– C is the capacitance (ability of a circuit to store energy)
– V is the voltage
– f is the frequency at which the processor operates
Intro to Parallel Programming 9
Power
Processor
Input Output
f
Processor
Processor
Input Output
f / 2 f / 2
f
Credits: Tim Mattson
Capacitance C Voltage V
Frequency f Power = C V 2 f
Capacitance 2.2 C Voltage 0.6 V
Frequency 0.5 f
Power = 0.396 C V 2 f
Intro to Parallel Programming 11
What happens today?
● HW designers create processors with more cores
● Result:
– parallel hardware is ubiquitous
– parallel software is rare
● The challenge
– Make parallel
software as common as parallel hardware
NVidia Tegra 4 SoC
Intro to Parallel Programming 12
Applications:
Numerical Wind Tunnel
Source: http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html
Applications:
Molecular dynamics
Intro to Parallel Programming 15
Applications:
Cosmological Simulation
Bolshoi simulation https://vimeo.com/29769051
The Bolshoi Simulation recreates the large-scale structure of the universe;
it required 6 million CPU hours on NASA's Pleiades Supercomputer
Source :
https://www.nas.nasa.gov/hecc/resources/pleiades.html
Intro to Parallel Programming 16
Example
Sum-reduction of an array
("Hello, world!" of parallel programming)
Intro to Parallel Programming 17
Sum-Reduction
● We start assuming a shared-memory architecture
– All execution units share a common memory space
● We begin with a sequential solution and parallelize it
– This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!
– However, it is sometimes a reasonable starting point
Credits: Mary Hall
Intro to Parallel Programming 18
Sequential algorithm
● Compute the sum of the content of an array A of length n
float seq_sum(float* A, int n) {
int i;
float sum = 0.0;
for (i=0; i<n; i++) { sum += A[i];
}
return sum;
}
Credits: Mary Hall
Intro to Parallel Programming 19
Version 1 (Wrong!)
● Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements
● Example: n = 15, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
sum += my_x;
}
Proc 0 Proc 1 Proc 2
WRONG
Variables whose names start with my_ are assumed to be local (private) to each processor; all other
variables are assumed to be global (shared)
Intro to Parallel Programming 20
Version 1 (Wrong!)
● Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements
● Example: n = 15, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
sum += my_x;
}
Proc 0 Proc 1 Proc 2
WRONG
Variables whose names start with my_ are assumed to be local (private) to each processor; all other
variables are assumed to be global (shared)
Race condition!
Intro to Parallel Programming 21
Version 1 (better, but still wrong)
● Assuming P processors, each one computes a partial sum of n / P adjacent elements
● Example: n = 15, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
mutex_lock(&m);
sum += my_x;
mutex_unlock(&m);
}
WRONG
Proc 0 Proc 1 Proc 2
Intro to Parallel Programming 22
Version 1 (better, but still wrong)
● Assuming P processors, each one computes a partial sum of n / P adjacent elements
● Esempio: n = 17, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
mutex_lock(&m);
sum += my_x;
mutex_unlock(&m);
}
?? ??
Proc 0 Proc 1 Proc 2
WRONG
Intro to Parallel Programming 23
Version 1
(correct, but not efficient)
● Assuming P processors, each one computes a partial sum of n / P adjacent elements
● Example: n = 17, P = 3
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
mutex_lock(&m);
sum += my_x;
mutex_unlock(&m);
}
Proc 0 Proc 1 Proc 2
Intro to Parallel Programming 24
Version 2
● Too much contention on mutex m
– Each processor acquires and releases the mutex for each element of the array!
● Solution: increase the mutex granularity
– Each processor accumulates the partial sum on a local (private) variable
– The mutex is used at the end to update the global sum
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
sum = 0.0; my_sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
my_sum += my_x;
}
mutex_lock(&m);
sum += my_sum;
mutex_unlock(&m);
Intro to Parallel Programming 25
Version 3: Remove the mutex
(wrong, in a subtle way)
● We use a shared array psum[] where each processor can store its local sum
● At the end, one processor computes the global sum
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
psum[0..P-1] = 0.0; /* all elements set to 0.0 */
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
psum[my_id] += my_x;
}
if ( 0 == my_id ) { /* only the master executes this */
sum = 0.0;
for (my_i=0; my_i<P; my_i++) sum += psum[my_i];
}
WRONG
Intro to Parallel Programming 26
The problem with version 3
● Processor 0 could start the computation of the global sum before all other processors have computed the local sums!
Compute local sums
Compute global sum
P0 P1 P2 P3
Intro to Parallel Programming 27
Version 4 (correct)
● Use a barrier synchronization
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
psum[0..P-1] = 0.0;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
psum[my_id] += my_x;
}
barrier();
if ( 0 == my_id ) { sum = 0.0;
for (my_i=0; my_i<P; my_i++) sum += psum[my_i];
}
Compute local sums
Compute global sum
P0 P1 P2 P3
barrier()
Intro to Parallel Programming 28
Version 5
Distributed-memory version
● P << n processors
● Each processor
computes a local sum
● Each processor
sends the local sum to processor 0 (the master)
...my_sum = 0.0;
my_start = …, my_end = …;
for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);
}
if ( 0 == my_id ) {
for ( i=1; i<P; i++ ) {
tmp = receive from proc i;
my_sum += tmp;
}
printf(“The sum is %f\n”, my_sum);
} else {
send my_sum to thread 0;
}
Intro to Parallel Programming 29
Version 5
Proc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
15 4 2 9 3 8 11
Intro to Parallel Programming 30
Version 5
Proc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
15 4 2 9 3 8 11
Bottleneck
Intro to Parallel Programming 31
Parallel reduction
Proc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
5 -1 7
6
15 4
9
● (P – 1) sums are still performed; however, processor 0 receives
~ log2 P messages and performs ~ log2 P sums
Intro to Parallel Programming 32
Parallel architectures
Intro to Parallel Programming 33
Flynn's Taxonomy
SISD
Single Instruction Stream Single Data Stream
SIMD
Single Instruction Stream Multiple Data Streams
MISD
Multiple Instruction Streams Single Data Stream
MIMD
Multiple Instruction Streams Multiple Data Streams
Data Streams
Single Multiple
Instruction Streams MultipleSingle
Von Neumann architecture
Intro to Parallel Programming 34
Single Instruction Multiple Data
● SIMD instructions apply the same operation (e.g.,
sum, product…) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands)
– This means that there must be 4/8/... independent ALUs
LOAD A[0]
LOAD B[0]
C[0] = A[0] + B[0]
STORE C[0]
LOAD A[1]
LOAD B[1]
C[1] = A[1] + B[1]
STORE C[1]
LOAD A[2]
LOAD B[2]
C[2] = A[2] + B[2]
STORE C[2]
LOAD A[3]
LOAD B[3]
C[3] = A[3] + B[3]
STORE C[3]
Time
Intro to Parallel Programming 36
GPU
Fermi GPU die (source:
http://www.legitreviews.com/article/1100/1/)
● Modern GPUs
(Graphics Processing Units) can be regarded as a special kind of
SIMD architecture
Intro to Parallel Programming 37
MIMD
● In MIMD systems there are multiple execution units that can execute multiple sequences of instructions
– Multiple Instruction Streams
● Each execution unit might operate on a different input
– Multiple Data Streams
CALL F() z = 8 y = 1.7 z = x + y
a = 18 b = 9 if ( a>b ) c = 7
a = a - 1
w = 7 t = 13 k = G(w,t)
k = k + 1
Time
LOAD A[0]
LOAD B[0]
C[0] = A[0] + B[0]
STORE C[0]
Intro to Parallel Programming 38
MIMD architectures
● Shared Memory
– A set of processors sharing a common memory space
– Each processor can access any memory location
● Distributed Memory
– A set of compute nodes connected through an interconnection
network
● The most simple example:
cluster of PCs connected via Ethernet
– Nodes must communicate explicitly to share data
Memory
CPU CPU CPU
Interconnect
CPU
CPU Mem
CPU Mem
CPU Mem
CPU Mem
Interconnect
Intro to Parallel Programming 40
Hybrid architectures
● Many HPC systems are based on hybrid architectures
– Each compute node is a shared-memory multiprocessor
– Many compute nodes are connected through a network
CPU Mem
Interconnect CPU
CPU CPU GPU GPU
CPU Mem CPU
CPU CPU GPU GPU
CPU Mem CPU
CPU CPU GPU GPU
CPU Mem CPU
CPU CPU GPU GPU
Intro to Parallel Programming 41
Recap
Shared memory
● Advantages:
– Easy to program
– Useful for applications with irregular data access patterns (e.g., graph algorithms)
● Disadvantages:
– The programmer must take care of race conditions
– Limited memory bandwidth
Distributed memory
● Advantages:
– Scalable, provides high
computational power by adding more nodes
– Useful for applications with strong locality of reference, with high computation /
communication ratio
● Disadvantages:
– Latency of interconnect network
– Difficult to program
Intro to Parallel Programming 42
Efficiency and scalability of parallel programs
Intro to Parallel Programming 43
Scalability
● How much faster can a given problem be solved with p workers instead of one?
● How much more work can be done with p workers instead of one?
● What impact for the communication requirements of the parallel application have on performance?
● What fraction of the resources is actually used productively for solving the problem?
Credits: Daivd Padua
Intro to Parallel Programming 44
Speedup
● Let us define:
– p = Number of processors / cores
– Tserial = Execution time of the serial program
– Tparallel (p) = Execution time of the parallel program with p
processors / cores
Intro to Parallel Programming 45
Speedup
● Speedup S(p)
● In the ideal case, the parallel program requires 1/p the time of the sequential program
● S(p) = p is the ideal case of linear speedup
– Realistically, S(p) ≤ p
– Is it possible to observe S(p) > p ?
S ( p)= T
serialT
parallel( p)
Usually Tserial = Tparallel(1)
Intro to Parallel Programming 46
Non-parallelizable portions
● Suppose that a fraction α of the total execution time of the serial program can not be parallelized
– E.g., due to:
● Algorithmic limitations (data dependencies)
● Bottlenecks (e.g., shared resources)
● Startup overhead
● Communication costs
● Suppose that the remaining fraction (1 - α) can be fully parallelized
● Then, we have:
Tparallel( p) = α T serial+(1−α)Tserial p
Intro to Parallel Programming 47
Example
Intrinsecally serial part
Parallelizable part
p = 1 p = 2 p = 4
Tserial α Tserial(1- α) Tserial
T
parallel( p)=α T
serial+ ( 1−α)T
serialp
Tparallel(2) Tparallel(4)
Intro to Parallel Programming 48
Example
● Suppose that a program has Tserial = 20s
● Assume that 10% of the time is spent in a serial portion of the program
● Therefore, the execution time of a parallel version with p processors is
T parallel( p) = 0.1Tserial+ 0.9 T serial
p = 2+ 18 p
Intro to Parallel Programming 49
Example (cont.)
● The speedup is
● What is the maximum speedup that can be achieved when p → +∞ ?
S ( p) = T
serial0.1×T
serial+ 0.9×T
serialp
= 20 2+ 18
p
Intro to Parallel Programming 50
S ( p) = T
serialT
parallel( p)
= T
serialα T
serial+ ( 1−α)T
serialp
= 1
α+ 1−α p
Amdahl's Law
● What is the maximum speedup?
Gene Myron Amdahl (1922—)
Amdahl's Law
Intro to Parallel Programming 51
Amdahl's Law
● From
we get an asymptotic speedup 1 / α when p grows to infinity
S ( p)= 1
α+ ( 1−α) p
If a fraction α of the total execution time is spent on a serial portion of the program, then the
maximum achievable speedup is 1/α
Intro to Parallel Programming 53
Scaling Efficiency
● Strong Scaling: increase the number of processors p keeping the total problem size fixed
– The total amount of work remains constant
– The amount of work for a single processor decreases as p increases
– Goal: reduce the total execution time by adding more processors
● Weak Scaling: increase the number of processors p keeping the per-processor work fixed
– The total amount of work grows as p increases
– Goal: solve larger problems within the same amount of time
Intro to Parallel Programming 54
Strong Scaling Efficiency
● E(p) = Strong Scaling Efficiency
● where
– Tserial = Execution time of the serial program
– Tparallel (p) = Execution time of the parallel program with p
processors / cores
E ( p) = S ( p)
p = T
serialp×T
parallel( p)
55
Strong Scaling Efficiency and Amdahl's Law
Intro to Parallel Programming 56
Weak Scaling Efficiency
● W(p) = Weak Scaling Efficiency
● where
– T1 = time required to complete 1 work unit with 1 processor
– Tp = time required to complete p work units with p processors
W ( p)= T
1T
pIntro to Parallel Programming 57
Example
● See omp-matmul.c
– demo-strong-scaling.sh computes the times required for strong scaling (demo-strong-scaling.ods)
– demo-weak-scaling.sh computes the times required for weak scaling (demo-weak-scaling.ods)
● Important note
– For a given n, the amount of “work” required to compute the product of two n ´ n matrices is ~ n3
– To double the total work, do not double n!
● The amount of work required to compute the product of two (2n ´ 2n) matrices is ~ 8n3, eight times the n ´ n case
– To double the work, we must use matrices of size
– In general, with p processes we use matrices of size
(√3 2n × √3 2 n)
(√3 p n × √3 p n)