High Performance High Performance Computing Computing

(1)

High Performance High Performance

Computing Computing

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

[email protected]

Pacheco, Chapter 1

(2)

High Performance Computing 2

Copyright © 2013—2021

Moreno Marzolla, Università di Bologna, Italy

https://www.moreno.marzolla.name/teaching/HPC/

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit

https://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons,

543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

(3)

Credits

●

prof. Salvatore Orlando, Univ. Ca' Foscari di Venezia http://www.dsi.unive.it/~orlando/

●

prof. Mary Hall, University of Utah https://www.cs.utah.edu/~mhall/

●

Tim Mattson, Intel

(4)

Who I am

●

Moreno Marzolla

– Associate professor @ DISI

– https://www.moreno.marzolla.name/

●

Current and past teaching activity

– High Performance Computing

– Lab Algorithms and Data Structures

– Past: Algorithms and Data Structures; Complex Systems;

Software Engineering; Introduction to Programming

●

Research activity

– Parallel programming

– Modeling and simulation

(5)

High Performance Computing

●

Web page

– https://www.moreno.marzolla.name/teaching/HPC/

●

Schedule

– Monday 09:00–12:00 room 2.4

– Tuesday 15:00–18:00 lab 2.2 (First week only)

– Wednesday 15:00–18:00 lab 2.2 (From 2

^nd

week)

– Check the Web page and the official timetable for last- minute changes

●

Office hours

– Online (Teams); please send e-mail

(6)

References

●

Peter S. Pacheco, An Introduction to Parallel Programming, Morgan

Kaufmann 2011, ISBN 9780123742605

– Theory + OpenMP/MPI programming

●

CUDA C programming guide

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

– CUDA/C programming

●

See the Web page for slides and links to online material

https://www.moreno.marzolla.name/

(7)

Prerequisites

High Performance Computing

Programming Algorithms and

Data Structures Computer

Architectures Operating Systems

….

(8)

Syllabus

●

6 CFU (~ 60 hours of lectures/lab)

– Lectures ~40 hours

– Lab sessions ~20 hours

●

Theory (first ~3 weeks)

– Parallel architectures

– Parallel programming patterns

– Performance evaluation of parallel programs

●

Parallel programming (rest of the course)

– Shared-memory programming with C/OpenMP

– Distributed-memory programming with C/MPI

– GPU programming with CUDA/C

– SIMD programming (if there is enough time)

(9)

Lab sessions

●

Hands-on programming exercises

●

We will work under Linux only

●

Why?

Updated on June 2021;

Source: http://www.top500.org

(10)

Hardware resources

●

isi-raptor03.csr.unibo.it

– Dual socket Xeon, 12 cores, 64 GB RAM, Ubuntu 16.04

– 3x NVidia GeForce GTX 1070

(11)

Exam

●

Written exam (weight: 40%)

–

Questions/simple exercises on all topics discussed during the course (sample exams are available on the Web page)

–

6 dates: 2 in the winter term (jan/feb 2022); 3 in the summer term (jun/jul 2022); 1 in the fall term (sep 2022)

–

You can refuse the grade and retry the written exam

●

Individual programming project + written report (weight: 60%)

–

Specification defined by the instructor; one project for each academic year

–

Project can be handed in at any time before September 30, 2022

–

There is no discussion, unless I request it

–

If you refuse the grade, you must hand-in a NEW project on NEW specifications

●

Final grade rounded to the nearest integer

●

The written exam and programming project are independent, and can be completed in any order

●

Grades are valid until September 30, 2022

–

After that, a new academic year starts

–

There is no guarantee that the instructor and/or type of exam will remain the same

(12)

Grading the programming project

●

Correctness

●

Clarity

●

Efficiency

●

Quality of the written report

– Proper grammar, syntax, ...

– Technical correctness

– Performance evaluation

(13)

No cheating

(14)

Questions?

(15)

Intro to parallel programming

(16)

High Performance Computing

●

Many applications need considerable computing power

– Weather forecast, climate modeling, physics simulation, product engineering, 3D animation, finance, ...

●

Why?

– To solve more complex problems

– To solve the same problem in less time

– To solve the same problem more accurately

– To make better use of available computing resources

(17)

Intro to Parallel Programming 17

Applications:

Numerical Wind Tunnel

Source: http://ecomodder.com/forum/showthread.php/random-wind-tunnel-smoke-pictures-thread-26678-12.html

(18)

Applications:

Molecular dynamics

Source: http://bionano.physics.illinois.edu/node/109

(19)

Applications: Global Climate models

Source: Chapter 1 of Climate Change 2013: The Physical Science Basis https://www.ipcc.ch/report/ar5/wg1/

(20)

Applications:

Cosmological Simulation

Bolshoi simulation

https://www.youtube.com/watch?v=gmcvFmlkYjY

The Bolshoi Simulation recreates the large-scale structure of the universe;

it required 6 million CPU hours on NASA's Pleiades Supercomputer

Source :

https://www.nas.nasa.gov/hecc/resources/pleiades.html

(21)

Moore's Law

"The number of transistors on an IC doubles every 24 months"

●

That used to mean that every new generation of processors was based on smaller transistors

Moore, G.E., Cramming more components onto

integrated circuits. Electronics, 38(8), April 1965 Gordon E.

Moore

(1929– )

(22)

Physics lesson

●

Smaller transistors → Faster processor

●

Faster processor → Higher power consumption

●

Higher power consumption → More heat produced

●

More heat produced → Unreliable processor

(23)

Power

●

The power required by an IC (e.g., a processor) can be expressed as

Power = C ´ V ² ´ f where:

– C is the capacitance (ability of a circuit to store energy)

– V is the voltage

– f is the frequency at which the processor operates

(24)

Power

Processor

Input Output

f

Processor

Input Output

f / 2 f / 2

f

Credits: Tim Mattson

Capacitance C Voltage V

Frequency f Power = C V

²

f

Capacitance 2.2 C Voltage 0.6 V

Frequency 0.5 f

Power = 0.396 C V

²

f

(25)

Source: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

(26)

Processor/Memory wall

Source: John L. Hennessy, David A. Patterson, Computer Architecture: a

Quantitative Approach, Fifth Ed., Morgan Kaufman 2012, ISBN: 978-0-12-383872-8

(27)

What happens today?

●

HW designers create processors with more cores

●

Result:

– parallel hardware is ubiquitous

– parallel software is rare

●

The challenge

– Make parallel

software as common as parallel hardware

NVidia Tegra 4 SoC

(28)

Parallel programming in brief

●

Decompose the problem in sub-problems

●

Distribute sub-problems to the available execution units

●

Solve sub-problems independently

– Cooperate to solve sub-problems

●

Goals

– Reduce the wall-clock time

– Balance the workload across execution units

– Reduce communication and synchronization overhead

Credits: S. Orlando

(29)

Concurrency vs Parallelism

Credits: Tim Mattson, Intel

Task 1 Task 2 Task 3

Concurrency without parallelism Parallelism

(30)

The “Holy Grail”

●

Write serial code and have a “smart” compiler capable of parallelizing programs automatically

●

It has been done in some very specific cases

– In practice, no compiler proved to be “smart” enough

●

Writing efficient parallel code requires that the

programmer understands, and makes explicit use of,

the underlying hardware Here we come

(31)

Parallel programming is difficult

Serial version

(~49 lines of C++ code) Parallel version

(~1000 lines of C/C++ code)

http://www.moreno.marzolla.name/software/svmcell/

.

(32)

Example

Sum-reduction of an array

("Hello, world!" of parallel programming)

(33)

Sum-Reduction

●

We start assuming a shared-memory architecture

– All execution units share a common memory space

●

We begin with a sequential solution and parallelize it

– This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!

– However, it is sometimes a reasonable starting point

Credits: Mary Hall

(34)

Sequential algorithm

●

Compute the sum of the content of an array A of length n

**float seq_sum(float* A, int n)** {

int i;

float sum = 0.0;

for (i=0; i<n; i++) { sum += A[i];

}

return sum;

}

Credits: Mary Hall

(35)

Version 1 (Wrong!)

●

Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

●

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { sum += v[my_i];

}

Proc 0 Proc 1 Proc 2

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

Race condition!

.

v[]

(36)

Version 1 (better, but still wrong)

●

Assuming P processors, each one computes a partial sum of n / P adjacent elements

●

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

**sum = 0.0; /* Suppose this is executed exactly once */**

mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);

sum += v[my_i];

mutex_unlock(&m);

}

WRONG

Proc 0 Proc 1 Proc 2

v[]

(37)

Version 1 (better, but still wrong)

●

Assuming P processors, each one computes a partial sum of n / P adjacent elements

●

Esempio: n = 17, P = 3

?? ??

Proc 0 Proc 1 Proc 2

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

**sum = 0.0; /* Suppose this is executed exactly once */**

mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);

sum += v[my_i];

mutex_unlock(&m);

}

WRONG

v[]

(38)

Version 1

(correct, but not efficient)

●

Assuming P processors, each one computes a partial sum of n / P adjacent elements

●

Example: n = 17, P = 3

Proc 0 Proc 1 Proc 2

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

**sum = 0.0; /* Suppose this is executed exactly once */**

mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { mutex_lock(&m);

sum += v[my_i];

mutex_unlock(&m);

}

v[]

(39)

Version 2

●

Too much contention on mutex m

–

Each processor acquires and releases the mutex for each element of the array!

●

Solution: increase the mutex granularity

–

Each processor accumulates the partial sum on a local (private) variable

–

The mutex is used at the end to update the global sum

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; my_sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_sum += v[my_i];

} mutex_lock(&m);

sum += my_sum;

mutex_unlock(&m);

(40)

Version 3: Remove the mutex

(wrong, in a subtle way)

●

We use a shared array psum[] where each processor can store its local sum

●

At the end, one processor computes the global sum

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

**psum[0..P-1] = 0.0; /* all elements set to 0.0 */**

for (my_i=my_start; my_i<my_end; my_i++) { psum[my_id] += v[my_i];

}

**if ( 0 == my_id ) { /* only the master executes this */**

sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

WRONG

(41)

The problem with version 3

●

Processor 0 could start the computation of the global sum before all other processors have computed the local sums!

Compute local sums

Compute global sum

P0 P1 P2 P3

(42)

Version 4 (correct)

●

Use a barrier synchronization

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { psum[my_id] += v[my_i];

} barrier();

if ( 0 == my_id ) { sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

Compute local sums

Compute global sum

P0 P1 P2 P3

barrier()

(43)

Version 5

Distributed-memory version

●

P << n processors

●

All variables are private

●

Each processor

computes a local sum

●

Each processor

sends the local sum to processor 0 (the master)

... my_sum = 0.0;

my_start = …, my_end = …;

for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);

}

if ( 0 == my_id ) {

for ( i=1; i<P; i++ ) {

tmp = receive from proc i;

my_sum += tmp;

}

printf(“The sum is %f\n”, my_sum);

} else {

send my_sum to thread 0;

}

(44)

Version 5

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

15 4 2 9 3 8 11

Bottleneck

.

(45)

Parallel reduction

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

5 -1 7

6 15 4

9

●

(P – 1) sums are still performed; however, processor 0 receives

~ log

₂

P messages and performs ~ log

₂

P sums

(46)

Task parallelism vs Data parallelism

●

Task Parallelism

– Distribute (possibly different) tasks to processors

●

Data Parallelism

– Distribute data to processors

– Each processor executes the same task on different data

(47)

Example

●

We have a table containing hourly temperatures on some location

– 24 columns, 365 rows

●

Compute the minimum, maximum and average temperatore for each day

●

Assume we have 3 independent processors

(48)

Example

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364 D ay s (0 — 36 4)

(49)

Data parallel approach

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364 D ay s (0 — 36 4)

Proc 0

Proc 1

Proc 2

(50)

Task parallel approach

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364 D ay s (0 — 36 4) P ro c 0 P ro c 1 P ro c 2

(51)

Key concepts

●

Parallel architectures “naturally” derive from physics laws

●

Parallel architectures require parallel programming paradigms

●