Architectures and Architectures and
Platforms for AI module 2 Platforms for AI module 2
Moreno Marzolla
Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna
http://www.moreno.marzolla.name/
Pacheco, Chapter 1
High Performance Computing 2
High Performance Computing 3
Credits
●
prof. Salvatore Orlando, Univ. Ca' Foscari di Venezia http://www.dsi.unive.it/~orlando/
●
prof. Mary Hall, University of Utah https://www.cs.utah.edu/~mhall/
●
Tim Mattson, Intel
High Performance Computing 4
Who I am
●
Moreno Marzolla
– Associate professor @ DISI (Cesena campus)
– https://www.moreno.marzolla.name/
●
Recent teaching activity
– High Performance Computing @ ISI
– Foundations of Computer Science
– Algorithms and Data Structures
●
Research activity
– Parallel programming
– Modeling and simulation
High Performance Computing 5
Architectures and Platforms for AI module 2
●
Web page
– https://www.moreno.marzolla.name/teaching/APAI
●
Schedule
– Monday 11:00—13:30 Room 0.6
– Check the Web page and the official course timetable for variations
●
Office hours
– On demand via Teams; please send e-mail
High Performance Computing 6
Syllabus
●
2 CFU (~ 24 hours of lectures)
●
Theory
–
Parallel programming patterns
–
Performance evaluation of parallel programs
●
Parallel programming practice
–
Shared-memory programming with C/OpenMP
–
GPU programming with CUDA/C
High Performance Computing 7
References
●
Peter S. Pacheco, An Introduction to Parallel Programming, Morgan
Kaufmann 2011, ISBN 9780123742605
– Theory + C/OpenMP programming
●
CUDA C programming guide
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
– CUDA/C programming
●
See the Web page for slides and links to online material
https://www.moreno.marzolla.name/
High Performance Computing 8
Prerequisites
●
Mandatory
– You must be good at C programming
●
Strongly suggested
– (Foundations of) Operating Systems
– (Foundations of) Computer Architectures
– (Foundations of) Algorithms and Data Structures
High Performance Computing 9
Hardware resources
●
isi-raptor03.csr.unibo.it
– Dual socket Xeon, 12 cores, 64 GB RAM, Ubuntu 16.04
– 3x NVidia GeForce GTX 1070
High Performance Computing 10
Exam
●
Individual programming project + written report
– Specification defined by the instructor
– One project for the whole academic year
– The project can be handed in at any time
– There is no discussion, unless I need explanations
●
Final grade
– Weighted average of modules 1+3 (weight 4) and module 2 (weight 2)
– Rounded to the nearest integer
– Honors (“lode”) assigned by the instructors for outstanding work
●
Grades remain valid until sep 30, 2021
– After that, a new edition of this course starts
High Performance Computing 11
Grading the programming project
●
Correctness
●
Clarity
●
Efficiency
●
Quality of the written report
– Proper grammar, syntax, ...
– Technical correctness
– Performance evaluation
High Performance Computing 12
Questions?
High Performance Computing 13
Intro to parallel programming
High Performance Computing 14
High Performance Computing
●
Many applications need considerable computing power
– Weather forecast, climate modeling, physics simulation, product engineering, 3D animation, finance, ...
●
Why?
– To solve more complex problems
– To solve the same problem in less time
– To solve the same problem more accurately
– To make better use of available computing resources
High Performance Computing 15
Parallel programming
●
“Traditional” scientific paradigm
– Make a theory, then experiment
●
“Traditional” engineering paradigm
– Design, then build
●
Enter numeric experimentation and prototyping
– Some phenomena are too complex to be modeled accurately (e.g., weather forecast)
– Some experiments are too complex, or costly, or dangerous, or impossible to do in the lab (e.g., wind tunnels, seismic
simulations, stellar dynamics...)
●
Computational science
– Numerical simulations are becoming a new way to “do science”
Slide credits: S. Orlando
High Performance Computing 16
Applications:
Neural Networks
Applications:
Molecular dynamics
Intro to Parallel Programming 18
Applications:
Cosmological Simulation
Bolshoi simulation https://vimeo.com/29769051
The Bolshoi Simulation recreates the large-scale structure of the universe;
it required 6 million CPU hours on NASA's Pleiades Supercomputer
Source :
https://www.nas.nasa.gov/hecc/resources/pleiades.html
Intro to Parallel Programming 19
Moore's Law
"The number of transistors on an IC doubles every 24 months"
●
That used to mean that every new generation of processors was based on smaller transistors
Moore, G.E., Cramming more components onto
integrated circuits. Electronics, 38(8), April 1965 Gordon E.
Moore (1929– )
Intro to Parallel Programming 20 By Wgsimon - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=15193542 Log scale
Intro to Parallel Programming 21
Physics lesson
● Smaller transistors → Faster processor
● Faster processor → Higher power consumption
● Higher power consumption → More heat produced
● More heat produced → Unreliable processor
Intro to Parallel Programming 22
Power
●
The power required by an IC (e.g., a processor) can be expressed as
Power = C ´ V
2´ f where:
– C is the capacitance (ability of a circuit to store energy)
– V is the voltage
– f is the frequency at which the processor operates
Intro to Parallel Programming 23
Power
Processor
Input Output
f
Processor
Processor
Input Output
f / 2 f / 2
f
Credits: Tim Mattson
Capacitance C Voltage V
Frequency f Power = C V 2 f
Capacitance 2.2 C Voltage 0.6 V
Frequency 0.5 f
Power = 0.396 C V 2 f
Intro to Parallel Programming 24
Source: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/
High Performance Computing 25
Processor/Memory wall
Source: John L. Hennessy, David A. Patterson, Computer Architecture: a
Quantitative Approach, Fifth Ed., Morgan Kaufman 2012, ISBN: 978-0-12-383872-8
High Performance Computing 26
Limits
●
There are limits to “automatic” improvement of scalar performance:
– The Power Wall: Clock frequency cannot be increased without exceeding air cooling
– The Memory Wall: Access to data is a limiting factor
– The ILP Wall: All the existing instruction-level parallelism (ILP) is already being used
●
Conclusion:
– Explicit parallel mechanisms and explicit parallel programming are required for performance scaling
Slide credits: Hebenstreit, Reinders, Robison, McCool, SC13 tutorial
Intro to Parallel Programming 27
What happens today?
●
HW designers create processors with more cores
●
Result:
– parallel hardware is ubiquitous
– parallel software is rare
●
The challenge
– Make parallel
software as common as parallel hardware
NVidia Tegra 4 SoC
High Performance Computing 28
Parallel programming in brief
●
Decompose the problem in sub-problems
●
Distribute sub-problems to the available execution units
●
Solve sub-problems independently
– Cooperate to solve sub-problems
●
Goals
– Reduce the wall-clock time
– Balance the workload across execution units
– Reduce communication and synchronization overhead
Slide credits: S. Orlando
High Performance Computing 29
Concurrency vs Parallelism
Slide credits: Tim Mattson, Intel
Task 1 Task 2 Task 3
Concurrency without parallelism Parallelism
High Performance Computing 30
The “Holy Grail”
●
Write serial code and have a “smart” compiler capable of parallelizing programs automatically
●
It has been done in some very specific cases
– In practice, no compiler proved to be “smart” enough
●
Writing efficient parallel code requires that the
programmer understands, and makes explicit use of,
the underlying hardware
Here we comeHigh Performance Computing 31
Parallel programming is difficult
Serial version
(~49 lines of C++ code) Parallel version
(~1000 lines of C/C++ code)
http://www.moreno.marzolla.name/software/svmcell/
.
High Performance Computing 32
Issues of parallel programming
●
Writing parallel programs is in general much harder than writing sequential code
●
There is limited portability across different types of architectures
– E.g., a distributed-memory parallel program must be rewritten from scratch to run on a GPU
– However, there are standards (OpenMP, MPI, OpenCL) that allow portability across the same type of parallel architecture
●
Tuning for best performance is time-consuming
Intro to Parallel Programming 33
Example
Sum-reduction of an array
("Hello, world!" of parallel programming)
Intro to Parallel Programming 34
Sum-Reduction
●
We start assuming a shared-memory architecture
– All execution units share a common memory space
●
We begin with a sequential solution and parallelize it
– This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!
– However, it is sometimes a reasonable starting point
Credits: Mary Hall
Intro to Parallel Programming 35
Sequential algorithm
●
Compute the sum of the content of an array A of length n
float seq_sum(float* A, int n) {
int i;
float sum = 0.0;
for (i=0; i<n; i++) { sum += A[i];
}
return sum;
}
Credits: Mary Hall
Intro to Parallel Programming 36
Version 1 (Wrong!)
●
Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements
●
Example: n = 15, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
sum += my_x;
}
Proc 0 Proc 1 Proc 2
WRONG
Variables whose names start with my_ are assumed to be local (private) to each processor; all other
variables are assumed to be global (shared)
Race condition!
.
Intro to Parallel Programming 37
Version 1 (better, but still wrong)
●
Assuming P processors, each one computes a partial sum of n / P adjacent elements
●
Example: n = 15, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
mutex_lock(&m);
sum += my_x;
mutex_unlock(&m);
}
WRONG
Proc 0 Proc 1 Proc 2
Intro to Parallel Programming 38
Version 1 (better, but still wrong)
●
Assuming P processors, each one computes a partial sum of n / P adjacent elements
●
Esempio: n = 17, P = 3
my_block_len = n/P;
my_start = my_id * my_block_len;
my_end = my_start + my_block_len;
sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
mutex_lock(&m);
sum += my_x;
mutex_unlock(&m);
}
?? ??
Proc 0 Proc 1 Proc 2
WRONG
Intro to Parallel Programming 39
Version 1
(correct, but not efficient)
●
Assuming P processors, each one computes a partial sum of n / P adjacent elements
●
Example: n = 17, P = 3
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
mutex_lock(&m);
sum += my_x;
mutex_unlock(&m);
}
Proc 0 Proc 1 Proc 2
Intro to Parallel Programming 40
Version 2
● Too much contention on mutex m
– Each processor acquires and releases the mutex for each element of the array!
● Solution: increase the mutex granularity
– Each processor accumulates the partial sum on a local (private) variable
– The mutex is used at the end to update the global sum
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
sum = 0.0; my_sum = 0.0; mutex m;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
my_sum += my_x;
}
mutex_lock(&m);
sum += my_sum;
mutex_unlock(&m);
Intro to Parallel Programming 41
Version 3: Remove the mutex
(wrong, in a subtle way)
●
We use a shared array psum[] where each processor can store its local sum
●
At the end, one processor computes the global sum
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
psum[0..P-1] = 0.0; /* all elements set to 0.0 */
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
psum[my_id] += my_x;
}
if ( 0 == my_id ) { /* only the master executes this */
sum = 0.0;
for (my_i=0; my_i<P; my_i++) sum += psum[my_i];
}
WRONG
Intro to Parallel Programming 42
The problem with version 3
●
Processor 0 could start the computation of the global sum before all other processors have computed the local sums!
Compute local sums
Compute global sum
P0 P1 P2 P3
Intro to Parallel Programming 43
Version 4 (correct)
●
Use a barrier synchronization
my_start = n * my_id / P;
my_end = n * (my_id + 1) / P;
psum[0..P-1] = 0.0;
for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);
psum[my_id] += my_x;
}
barrier();
if ( 0 == my_id ) { sum = 0.0;
for (my_i=0; my_i<P; my_i++) sum += psum[my_i];
}
Compute local sums
Compute global sum
P0 P1 P2 P3
barrier()
Intro to Parallel Programming 44
Version 5
Distributed-memory version
●
P << n processors
●
Each processor
computes a local sum
●
Each processor
sends the local sum to processor 0 (the master)
...my_sum = 0.0;
my_start = …, my_end = …;
for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);
}
if ( 0 == my_id ) {
for ( i=1; i<P; i++ ) {
tmp = receive from proc i;
my_sum += tmp;
}
printf(“The sum is %f\n”, my_sum);
} else {
send my_sum to thread 0;
}
Intro to Parallel Programming 45
Version 5
Proc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
15 4 2 9 3 8 11
Bottleneck
.
Intro to Parallel Programming 46
Parallel reduction
Proc 0
A[ ]
my_sum
Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7
1 3 -2 7 -6 5 3 4
5 -1 7
6
15 4
9
● (P – 1) sums are still performed; however, processor 0 receives
~ log2 P messages and performs ~ log2 P sums
High Performance Computing 47
Task parallelism vs Data parallelism
●
Task Parallelism
– Distribute (possibly different) tasks to processors
●
Data Parallelism
– Distribute data to processors
– Each processor executes the same task on different data
High Performance Computing 48
Example
●
We have a table containing hourly temperatures on some location
– 24 columns, 365 rows
●
Compute the minimum, maximum and average temperatore for each day
●
Assume we have 3 independent processors
High Performance Computing 49
Example
max min ave
0 1 2 3 22 23
Hour (0—23)
0 1 2
364
Days (0—364)
High Performance Computing 50
Data parallel approach
max min ave
0 1 2 3 22 23
Hour (0—23)
0 1 2
364
Days (0—364)
Proc 0
Proc 1
Proc 2
High Performance Computing 51
Task parallel approach
max min ave
0 1 2 3 22 23
Hour (0—23)
0 1 2
364
Days (0—364)
P ro c 0 P ro c 1 P ro c 2
High Performance Computing 52
Key concepts
●
Parallel architectures “naturally” derive from physics laws
●
Parallel architectures require parallel programming paradigms
●