• Non ci sono risultati.

High Performance High Performance Computing Computing

N/A
N/A
Protected

Academic year: 2021

Condividi "High Performance High Performance Computing Computing"

Copied!
51
0
0

Testo completo

(1)

High Performance High Performance

Computing Computing

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

[email protected]

Pacheco, Chapter 1

(2)

High Performance Computing 2

Copyright © 2013—2021

Moreno Marzolla, Università di Bologna, Italy

https://www.moreno.marzolla.name/teaching/HPC/

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit

https://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons,

543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

(3)

High Performance Computing 3

Credits

prof. Salvatore Orlando, Univ. Ca' Foscari di Venezia http://www.dsi.unive.it/~orlando/

prof. Mary Hall, University of Utah https://www.cs.utah.edu/~mhall/

Tim Mattson, Intel

(4)

High Performance Computing 4

Who I am

Moreno Marzolla

– Associate professor @ DISI

– https://www.moreno.marzolla.name/

Current and past teaching activity

– High Performance Computing

– Lab Algorithms and Data Structures

– Past: Algorithms and Data Structures; Complex Systems;

Software Engineering; Introduction to Programming

Research activity

– Parallel programming

– Modeling and simulation

(5)

High Performance Computing 5

High Performance Computing

Web page

– https://www.moreno.marzolla.name/teaching/HPC/

Schedule

– Monday 09:00–12:00 room 2.4

– Tuesday 15:00–18:00 lab 2.2 (First week only)

– Wednesday 15:00–18:00 lab 2.2 (From 2

nd

week)

– Check the Web page and the official timetable for last- minute changes

Office hours

– Online (Teams); please send e-mail

(6)

High Performance Computing 6

References

Peter S. Pacheco, An Introduction to Parallel Programming, Morgan

Kaufmann 2011, ISBN 9780123742605

– Theory + OpenMP/MPI programming

CUDA C programming guide

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

– CUDA/C programming

See the Web page for slides and links to online material

https://www.moreno.marzolla.name/

(7)

High Performance Computing 7

Prerequisites

High Performance Computing

Programming Algorithms and

Data Structures Computer

Architectures Operating Systems

….

(8)

High Performance Computing 8

Syllabus

6 CFU (~ 60 hours of lectures/lab)

– Lectures ~40 hours

– Lab sessions ~20 hours

Theory (first ~3 weeks)

– Parallel architectures

– Parallel programming patterns

– Performance evaluation of parallel programs

Parallel programming (rest of the course)

– Shared-memory programming with C/OpenMP

– Distributed-memory programming with C/MPI

– GPU programming with CUDA/C

– SIMD programming (if there is enough time)

(9)

High Performance Computing 9

Lab sessions

Hands-on programming exercises

We will work under Linux only

Why?

Updated on June 2021;

Source: http://www.top500.org

(10)

High Performance Computing 10

Hardware resources

isi-raptor03.csr.unibo.it

– Dual socket Xeon, 12 cores, 64 GB RAM, Ubuntu 16.04

– 3x NVidia GeForce GTX 1070

(11)

High Performance Computing 11

Exam

Written exam (weight: 40%)

Questions/simple exercises on all topics discussed during the course (sample exams are available on the Web page)

6 dates: 2 in the winter term (jan/feb 2022); 3 in the summer term (jun/jul 2022); 1 in the fall term (sep 2022)

You can refuse the grade and retry the written exam

Individual programming project + written report (weight: 60%)

Specification defined by the instructor; one project for each academic year

Project can be handed in at any time before September 30, 2022

There is no discussion, unless I request it

If you refuse the grade, you must hand-in a NEW project on NEW specifications

Final grade rounded to the nearest integer

The written exam and programming project are independent, and can be completed in any order

Grades are valid until September 30, 2022

After that, a new academic year starts

There is no guarantee that the instructor and/or type of exam will remain the same

(12)

High Performance Computing 12

Grading the programming project

Correctness

Clarity

Efficiency

Quality of the written report

– Proper grammar, syntax, ...

– Technical correctness

– Performance evaluation

(13)

High Performance Computing 13

No cheating

(14)

High Performance Computing 14

Questions?

(15)

High Performance Computing 15

Intro to parallel programming

(16)

High Performance Computing 16

High Performance Computing

Many applications need considerable computing power

– Weather forecast, climate modeling, physics simulation, product engineering, 3D animation, finance, ...

Why?

– To solve more complex problems

– To solve the same problem in less time

– To solve the same problem more accurately

– To make better use of available computing resources

(17)

Intro to Parallel Programming 17

Applications:

Numerical Wind Tunnel

Source: http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html

(18)

Intro to Parallel Programming 18

Applications:

Molecular dynamics

Source: http://bionano.physics.illinois.edu/node/109

(19)

Intro to Parallel Programming 19

Applications: Global Climate models

Source: Chapter 1 of Climate Change 2013: The Physical Science Basis https://www.ipcc.ch/report/ar5/wg1/

(20)

Intro to Parallel Programming 20

Applications:

Cosmological Simulation

Bolshoi simulation

https://www.youtube.com/watch?v=gmcvFmlkYjY

The Bolshoi Simulation recreates the large-scale structure of the universe;

it required 6 million CPU hours on NASA's Pleiades Supercomputer

Source :

https://www.nas.nasa.gov/hecc/resources/pleiades.html

(21)

Intro to Parallel Programming 21

Moore's Law

"The number of transistors on an IC doubles every 24 months"

That used to mean that every new generation of processors was based on smaller transistors

Moore, G.E., Cramming more components onto

integrated circuits. Electronics, 38(8), April 1965 Gordon E.

Moore

(1929– )

(22)

Intro to Parallel Programming 22

Physics lesson

Smaller transistors → Faster processor

Faster processor → Higher power consumption

Higher power consumption → More heat produced

More heat produced → Unreliable processor

(23)

Intro to Parallel Programming 23

Power

The power required by an IC (e.g., a processor) can be expressed as

Power = C ´ V 2 ´ f where:

C is the capacitance (ability of a circuit to store energy)

V is the voltage

f is the frequency at which the processor operates

(24)

Intro to Parallel Programming 24

Power

Processor

Input Output

f

Processor

Processor

Input Output

f / 2 f / 2

f

Credits: Tim Mattson

Capacitance C Voltage V

Frequency f Power = C V

2

f

Capacitance 2.2 C Voltage 0.6 V

Frequency 0.5 f

Power = 0.396 C V

2

f

(25)

Intro to Parallel Programming 25

Source: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

(26)

High Performance Computing 26

Processor/Memory wall

Source: John L. Hennessy, David A. Patterson, Computer Architecture: a

Quantitative Approach, Fifth Ed., Morgan Kaufman 2012, ISBN: 978-0-12-383872-8

(27)

Intro to Parallel Programming 27

What happens today?

HW designers create processors with more cores

Result:

– parallel hardware is ubiquitous

– parallel software is rare

The challenge

– Make parallel

software as common as parallel hardware

NVidia Tegra 4 SoC

(28)

High Performance Computing 28

Parallel programming in brief

Decompose the problem in sub-problems

Distribute sub-problems to the available execution units

Solve sub-problems independently

– Cooperate to solve sub-problems

Goals

– Reduce the wall-clock time

– Balance the workload across execution units

– Reduce communication and synchronization overhead

Credits: S. Orlando

(29)

High Performance Computing 29

Concurrency vs Parallelism

Credits: Tim Mattson, Intel

Task 1 Task 2 Task 3

Concurrency without parallelism Parallelism

(30)

High Performance Computing 30

The “Holy Grail”

Write serial code and have a “smart” compiler capable of parallelizing programs automatically

It has been done in some very specific cases

– In practice, no compiler proved to be “smart” enough

Writing efficient parallel code requires that the

programmer understands, and makes explicit use of,

the underlying hardware Here we come

(31)

High Performance Computing 31

Parallel programming is difficult

Serial version

(~49 lines of C++ code) Parallel version

(~1000 lines of C/C++ code)

http://www.moreno.marzolla.name/software/svmcell/

.

(32)

Intro to Parallel Programming 33

Example

Sum-reduction of an array

("Hello, world!" of parallel programming)

(33)

Intro to Parallel Programming 34

Sum-Reduction

We start assuming a shared-memory architecture

– All execution units share a common memory space

We begin with a sequential solution and parallelize it

– This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!

– However, it is sometimes a reasonable starting point

Credits: Mary Hall

(34)

Intro to Parallel Programming 35

Sequential algorithm

Compute the sum of the content of an array A of length n

float seq_sum(float* A, int n) {

int i;

float sum = 0.0;

for (i=0; i<n; i++) { sum += A[i];

}

return sum;

}

Credits: Mary Hall

(35)

Intro to Parallel Programming 36

Version 1 (Wrong!)

Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { sum += v[my_i];

}

Proc 0 Proc 1 Proc 2

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

Race condition!

.

v[]

(36)

Intro to Parallel Programming 37

Version 1 (better, but still wrong)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0; /* Suppose this is executed exactly once */

mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);

sum += v[my_i];

mutex_unlock(&m);

}

WRONG

Proc 0 Proc 1 Proc 2

v[]

(37)

Intro to Parallel Programming 38

Version 1 (better, but still wrong)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Esempio: n = 17, P = 3

?? ??

Proc 0 Proc 1 Proc 2

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0; /* Suppose this is executed exactly once */

mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);

sum += v[my_i];

mutex_unlock(&m);

}

WRONG

v[]

(38)

Intro to Parallel Programming 39

Version 1

(correct, but not efficient)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Example: n = 17, P = 3

Proc 0 Proc 1 Proc 2

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; /* Suppose this is executed exactly once */

mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);

sum += v[my_i];

mutex_unlock(&m);

}

v[]

(39)

Intro to Parallel Programming 40

Version 2

Too much contention on mutex m

Each processor acquires and releases the mutex for each element of the array!

Solution: increase the mutex granularity

Each processor accumulates the partial sum on a local (private) variable

The mutex is used at the end to update the global sum

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; my_sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_sum += v[my_i];

} mutex_lock(&m);

sum += my_sum;

mutex_unlock(&m);

(40)

Intro to Parallel Programming 41

Version 3: Remove the mutex

(wrong, in a subtle way)

We use a shared array psum[] where each processor can store its local sum

At the end, one processor computes the global sum

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0; /* all elements set to 0.0 */

for (my_i=my_start; my_i<my_end; my_i++) { psum[my_id] += v[my_i];

}

if ( 0 == my_id ) { /* only the master executes this */

sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

WRONG

(41)

Intro to Parallel Programming 42

The problem with version 3

Processor 0 could start the computation of the global sum before all other processors have computed the local sums!

Compute local sums

Compute global sum

P0 P1 P2 P3

(42)

Intro to Parallel Programming 43

Version 4 (correct)

Use a barrier synchronization

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { psum[my_id] += v[my_i];

} barrier();

if ( 0 == my_id ) { sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

Compute local sums

Compute global sum

P0 P1 P2 P3

barrier()

(43)

Intro to Parallel Programming 44

Version 5

Distributed-memory version

P << n processors

All variables are private

Each processor

computes a local sum

Each processor

sends the local sum to processor 0 (the master)

... my_sum = 0.0;

my_start = …, my_end = …;

for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);

}

if ( 0 == my_id ) {

for ( i=1; i<P; i++ ) {

tmp = receive from proc i;

my_sum += tmp;

}

printf(“The sum is %f\n”, my_sum);

} else {

send my_sum to thread 0;

}

(44)

Intro to Parallel Programming 45

Version 5

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

15 4 2 9 3 8 11

Bottleneck

.

(45)

Intro to Parallel Programming 46

Parallel reduction

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

5 -1 7

6

15 4

9

(P – 1) sums are still performed; however, processor 0 receives

~ log

2

P messages and performs ~ log

2

P sums

(46)

High Performance Computing 47

Task parallelism vs Data parallelism

Task Parallelism

– Distribute (possibly different) tasks to processors

Data Parallelism

– Distribute data to processors

– Each processor executes the same task on different data

(47)

High Performance Computing 48

Example

We have a table containing hourly temperatures on some location

– 24 columns, 365 rows

Compute the minimum, maximum and average temperatore for each day

Assume we have 3 independent processors

(48)

High Performance Computing 49

Example

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

D ay s (0 — 36 4)

(49)

High Performance Computing 50

Data parallel approach

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

D ay s (0 — 36 4)

Proc 0

Proc 1

Proc 2

(50)

High Performance Computing 51

Task parallel approach

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

D ay s (0 — 36 4) P ro c 0 P ro c 1 P ro c 2

(51)

High Performance Computing 52

Key concepts

Parallel architectures “naturally” derive from physics laws

Parallel architectures require parallel programming paradigms

Writing parallel programs is harder than writing

sequential programs

Riferimenti

Documenti correlati

Questo, d’altro canto, è proprio ciò che è accaduto negli ultimi secoli, nel mondo della scienza, come denuncia Prigogine: «la critica della fisica si è

Appare evidente che secondo le previsioni, la domanda di trasporto aereo sarà molto sostenuta nella macro-area del Centro Nord, ed appare altrettanto evidente che

Another important consideration is that, when looking at the result concerning progression-free survival, even though median progression free survival was significantly

As the next step of our study, we tried to optimize the geometry of the tetragonal phase. Instead of considering the three structural parameters on the same footing, we decided to

rezza e la grandezza di Atene. Non si tratta neppure di un passato tanto lontano da evocare un’età aurea ormai conclusa per sempre. Demostene si riferisce ge- neralmente al

economica” 31. In altre parole consiste nell’individuazione di aggregazioni intermedie dei dati di costo legate tra loro da una destinazione univoca. La fase termina con

Abstract: An unprecedented, environmentally friendly, and faster method for the determination of Ochratoxin A (OTA) (a mycotoxin produced by several species of Aspergillus

Le chiese toscane lungo la Francige- na, dato anche il carattere minore del- la gran parte di esse, non mantengono inalterati questi caratteri, ma in esse permane la