High Performance High Performance
Computing Computing
Moreno Marzolla
Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna
[email protected]
Pacheco, Chapter 1
High Performance Computing 2
Copyright © 2013—2021
Moreno Marzolla, Università di Bologna, Italy
https://www.moreno.marzolla.name/teaching/HPC/
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit
https://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons,
543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
High Performance Computing 3
Credits
●
prof. Salvatore Orlando, Univ. Ca' Foscari di Venezia http://www.dsi.unive.it/~orlando/
●
prof. Mary Hall, University of Utah https://www.cs.utah.edu/~mhall/
●
Tim Mattson, Intel
High Performance Computing 4
Who I am
●
Moreno Marzolla
– Associate professor @ DISI
– https://www.moreno.marzolla.name/
●
Current and past teaching activity
– High Performance Computing
– Lab Algorithms and Data Structures
– Past: Algorithms and Data Structures; Complex Systems;
Software Engineering; Introduction to Programming
●
Research activity
– Parallel programming
– Modeling and simulation
High Performance Computing 5
High Performance Computing
●
Web page
– https://www.moreno.marzolla.name/teaching/HPC/
●
Schedule
– Monday 09:00–12:00 room 2.4
– Tuesday 15:00–18:00 lab 2.2 (First week only)
– Wednesday 15:00–18:00 lab 2.2 (From 2
ndweek)
– Check the Web page and the official timetable for last- minute changes
●
Office hours
– Online (Teams); please send e-mail
High Performance Computing 6
References
●
Peter S. Pacheco, An Introduction to Parallel Programming, Morgan
Kaufmann 2011, ISBN 9780123742605
– Theory + OpenMP/MPI programming
●
CUDA C programming guide
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
– CUDA/C programming
●
See the Web page for slides and links to online material
https://www.moreno.marzolla.name/
High Performance Computing 7
Prerequisites
High Performance Computing
Programming Algorithms and
Data Structures Computer
Architectures Operating Systems
….
High Performance Computing 8
Syllabus
●
6 CFU (~ 60 hours of lectures/lab)
– Lectures ~40 hours
– Lab sessions ~20 hours
●
Theory (first ~3 weeks)
– Parallel architectures
– Parallel programming patterns
– Performance evaluation of parallel programs
●
Parallel programming (rest of the course)
– Shared-memory programming with C/OpenMP
– Distributed-memory programming with C/MPI
– GPU programming with CUDA/C
– SIMD programming (if there is enough time)
High Performance Computing 9
Lab sessions
●
Hands-on programming exercises
●
We will work under Linux only
●
Why?
Updated on June 2021;
Source: http://www.top500.org
High Performance Computing 10
Hardware resources
●
isi-raptor03.csr.unibo.it
– Dual socket Xeon, 12 cores, 64 GB RAM, Ubuntu 16.04
– 3x NVidia GeForce GTX 1070
High Performance Computing 11
Exam
●
Written exam (weight: 40%)
–
Questions/simple exercises on all topics discussed during the course (sample exams are available on the Web page)
–
6 dates: 2 in the winter term (jan/feb 2022); 3 in the summer term (jun/jul 2022); 1 in the fall term (sep 2022)
–
You can refuse the grade and retry the written exam
●
Individual programming project + written report (weight: 60%)
–
Specification defined by the instructor; one project for each academic year
–
Project can be handed in at any time before September 30, 2022
–
There is no discussion, unless I request it
–
If you refuse the grade, you must hand-in a NEW project on NEW specifications
●
Final grade rounded to the nearest integer
●
The written exam and programming project are independent, and can be completed in any order
●
Grades are valid until September 30, 2022
–
After that, a new academic year starts
–
There is no guarantee that the instructor and/or type of exam will remain the same
High Performance Computing 12
Grading the programming project
●
Correctness
●
Clarity
●
Efficiency
●
Quality of the written report
– Proper grammar, syntax, ...
– Technical correctness
– Performance evaluation
High Performance Computing 13
No cheating
High Performance Computing 14
Questions?
High Performance Computing 15
Intro to parallel programming
High Performance Computing 16
High Performance Computing
●
Many applications need considerable computing power
– Weather forecast, climate modeling, physics simulation, product engineering, 3D animation, finance, ...
●
Why?
– To solve more complex problems
– To solve the same problem in less time
– To solve the same problem more accurately
– To make better use of available computing resources
Intro to Parallel Programming 17
Applications:
Numerical Wind Tunnel
Source: http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html
Intro to Parallel Programming 18
Applications:
Molecular dynamics
Source: http://bionano.physics.illinois.edu/node/109
Intro to Parallel Programming 19
Applications: Global Climate models
Source: Chapter 1 of Climate Change 2013: The Physical Science Basis https://www.ipcc.ch/report/ar5/wg1/
Intro to Parallel Programming 20
Applications:
Cosmological Simulation
Bolshoi simulation
https://www.youtube.com/watch?v=gmcvFmlkYjY
The Bolshoi Simulation recreates the large-scale structure of the universe;
it required 6 million CPU hours on NASA's Pleiades Supercomputer
Source :
https://www.nas.nasa.gov/hecc/resources/pleiades.html
Intro to Parallel Programming 21
Moore's Law
"The number of transistors on an IC doubles every 24 months"
●
That used to mean that every new generation of processors was based on smaller transistors
Moore, G.E., Cramming more components onto
integrated circuits. Electronics, 38(8), April 1965 Gordon E.
Moore
(1929– )
Intro to Parallel Programming 22
Physics lesson
●
Smaller transistors → Faster processor
●
Faster processor → Higher power consumption
●
Higher power consumption → More heat produced
●
More heat produced → Unreliable processor
Intro to Parallel Programming 23
Power
●
The power required by an IC (e.g., a processor) can be expressed as
Power = C ´ V 2 ´ f where:
– C is the capacitance (ability of a circuit to store energy)
– V is the voltage
– f is the frequency at which the processor operates
Intro to Parallel Programming 24
Power
Processor
Input Output
f
Processor
Processor
Input Output
f / 2 f / 2
f
Credits: Tim Mattson
Capacitance C Voltage V
Frequency f Power = C V
2f
Capacitance 2.2 C Voltage 0.6 V
Frequency 0.5 f
Power = 0.396 C V
2f
Intro to Parallel Programming 25
Source: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/
High Performance Computing 26
Processor/Memory wall
Source: John L. Hennessy, David A. Patterson, Computer Architecture: a
Quantitative Approach, Fifth Ed., Morgan Kaufman 2012, ISBN: 978-0-12-383872-8
Intro to Parallel Programming 27
What happens today?
●
HW designers create processors with more cores
●
Result:
– parallel hardware is ubiquitous
– parallel software is rare
●
The challenge
– Make parallel
software as common as parallel hardware
NVidia Tegra 4 SoC
High Performance Computing 28
Parallel programming in brief
●
Decompose the problem in sub-problems
●
Distribute sub-problems to the available execution units
●
Solve sub-problems independently
– Cooperate to solve sub-problems
●
Goals
– Reduce the wall-clock time
– Balance the workload across execution units
– Reduce communication and synchronization overhead
Credits: S. Orlando
High Performance Computing 29
Concurrency vs Parallelism
Credits: Tim Mattson, Intel
Task 1 Task 2 Task 3
Concurrency without parallelism Parallelism
High Performance Computing 30
The “Holy Grail”
●
Write serial code and have a “smart” compiler capable of parallelizing programs automatically
●
It has been done in some very specific cases
– In practice, no compiler proved to be “smart” enough
●
Writing efficient parallel code requires that the
programmer understands, and makes explicit use of,
the underlying hardware Here we come
High Performance Computing 31
Parallel programming is difficult
Serial version
(~49 lines of C++ code) Parallel version
(~1000 lines of C/C++ code)
http://www.moreno.marzolla.name/software/svmcell/
.
Intro to Parallel Programming 33
Example
Sum-reduction of an array
("Hello, world!" of parallel programming)
Intro to Parallel Programming 34
Sum-Reduction
●
We start assuming a shared-memory architecture
– All execution units share a common memory space
●
We begin with a sequential solution and parallelize it
– This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!
– However, it is sometimes a reasonable starting point
Credits: Mary Hall
Intro to Parallel Programming 35
Sequential algorithm
●
Compute the sum of the content of an array A of length n
float seq_sum(float* A, int n) {
int i;
float sum = 0.0;
for (i=0; i<n; i++) { sum += A[i];
}
return sum;
}
Credits: Mary Hall
Intro to Parallel Programming 36
Version 1 (Wrong!)
●
Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements
●
Example: n = 15, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0;
for (my_i=my_start; my_i<my_end; my_i++) { sum += v[my_i];
}
Proc 0 Proc 1 Proc 2
WRONG
Variables whose names start with my_ are assumed to be local (private) to each processor; all other
variables are assumed to be global (shared)
Race condition!
.
v[]
Intro to Parallel Programming 37
Version 1 (better, but still wrong)
●
Assuming P processors, each one computes a partial sum of n / P adjacent elements
●
Example: n = 15, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0; /* Suppose this is executed exactly once */
mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);
sum += v[my_i];
mutex_unlock(&m);
}
WRONG
Proc 0 Proc 1 Proc 2
v[]
Intro to Parallel Programming 38
Version 1 (better, but still wrong)
●
Assuming P processors, each one computes a partial sum of n / P adjacent elements
●
Esempio: n = 17, P = 3
?? ??
Proc 0 Proc 1 Proc 2
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0; /* Suppose this is executed exactly once */
mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);
sum += v[my_i];
mutex_unlock(&m);
}
WRONG
v[]
Intro to Parallel Programming 39
Version 1
(correct, but not efficient)
●
Assuming P processors, each one computes a partial sum of n / P adjacent elements
●
Example: n = 17, P = 3
Proc 0 Proc 1 Proc 2
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
sum = 0.0; /* Suppose this is executed exactly once */
mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);
sum += v[my_i];
mutex_unlock(&m);
}
v[]
Intro to Parallel Programming 40
Version 2
●
Too much contention on mutex m
–
Each processor acquires and releases the mutex for each element of the array!
●
Solution: increase the mutex granularity
–
Each processor accumulates the partial sum on a local (private) variable
–
The mutex is used at the end to update the global sum
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
sum = 0.0; my_sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_sum += v[my_i];
} mutex_lock(&m);
sum += my_sum;
mutex_unlock(&m);
Intro to Parallel Programming 41
Version 3: Remove the mutex
(wrong, in a subtle way)
●
We use a shared array psum[] where each processor can store its local sum
●
At the end, one processor computes the global sum
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
psum[0..P-1] = 0.0; /* all elements set to 0.0 */
for (my_i=my_start; my_i<my_end; my_i++) { psum[my_id] += v[my_i];
}
if ( 0 == my_id ) { /* only the master executes this */
sum = 0.0;
for (my_i=0; my_i<P; my_i++) sum += psum[my_i];
}
WRONG
Intro to Parallel Programming 42
The problem with version 3
●
Processor 0 could start the computation of the global sum before all other processors have computed the local sums!
Compute local sums
Compute global sum
P0 P1 P2 P3
Intro to Parallel Programming 43
Version 4 (correct)
●
Use a barrier synchronization
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
psum[0..P-1] = 0.0;
for (my_i=my_start; my_i<my_end; my_i++) { psum[my_id] += v[my_i];
} barrier();
if ( 0 == my_id ) { sum = 0.0;
for (my_i=0; my_i<P; my_i++) sum += psum[my_i];
}
Compute local sums
Compute global sum
P0 P1 P2 P3
barrier()
Intro to Parallel Programming 44
Version 5
Distributed-memory version
●
P << n processors
●
All variables are private
●
Each processor
computes a local sum
●
Each processor
sends the local sum to processor 0 (the master)
... my_sum = 0.0;
my_start = …, my_end = …;
for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);
}
if ( 0 == my_id ) {
for ( i=1; i<P; i++ ) {
tmp = receive from proc i;
my_sum += tmp;
}
printf(“The sum is %f\n”, my_sum);
} else {
send my_sum to thread 0;
}
Intro to Parallel Programming 45
Version 5
Proc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
15 4 2 9 3 8 11
Bottleneck
.
Intro to Parallel Programming 46
Parallel reduction
Proc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
5 -1 7
6
15 4
9
●
(P – 1) sums are still performed; however, processor 0 receives
~ log
2P messages and performs ~ log
2P sums
High Performance Computing 47
Task parallelism vs Data parallelism
●
Task Parallelism
– Distribute (possibly different) tasks to processors
●
Data Parallelism
– Distribute data to processors
– Each processor executes the same task on different data
High Performance Computing 48
Example
●
We have a table containing hourly temperatures on some location
– 24 columns, 365 rows
●
Compute the minimum, maximum and average temperatore for each day
●
Assume we have 3 independent processors
High Performance Computing 49
Example
max min ave
0 1 2 3 22 23
Hour (0—23)
0 1 2
364
D ay s (0 — 36 4)
High Performance Computing 50
Data parallel approach
max min ave
0 1 2 3 22 23
Hour (0—23)
0 1 2
364
D ay s (0 — 36 4)
Proc 0
Proc 1
Proc 2
High Performance Computing 51
Task parallel approach
max min ave
0 1 2 3 22 23
Hour (0—23)
0 1 2
364
D ay s (0 — 36 4) P ro c 0 P ro c 1 P ro c 2
High Performance Computing 52
Key concepts
●
Parallel architectures “naturally” derive from physics laws
●
Parallel architectures require parallel programming paradigms
●