• Non ci sono risultati.

Architectures and Architectures and Platforms for AI module 2 Platforms for AI module 2

N/A
N/A
Protected

Academic year: 2021

Condividi "Architectures and Architectures and Platforms for AI module 2 Platforms for AI module 2"

Copied!
52
0
0

Testo completo

(1)

Architectures and Architectures and

Platforms for AI module 2 Platforms for AI module 2

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

http://www.moreno.marzolla.name/

Pacheco, Chapter 1

(2)

High Performance Computing 2

(3)

High Performance Computing 3

Credits

prof. Salvatore Orlando, Univ. Ca' Foscari di Venezia http://www.dsi.unive.it/~orlando/

prof. Mary Hall, University of Utah https://www.cs.utah.edu/~mhall/

Tim Mattson, Intel

(4)

High Performance Computing 4

Who I am

Moreno Marzolla

Associate professor @ DISI (Cesena campus)

https://www.moreno.marzolla.name/

Recent teaching activity

High Performance Computing @ ISI

Foundations of Computer Science

Algorithms and Data Structures

Research activity

Parallel programming

Modeling and simulation

(5)

High Performance Computing 5

Architectures and Platforms for AI module 2

Web page

https://www.moreno.marzolla.name/teaching/APAI

Schedule

Monday 11:00—13:30 Room 0.6

Check the Web page and the official course timetable for variations

Office hours

On demand via Teams; please send e-mail

(6)

High Performance Computing 6

Syllabus

2 CFU (~ 24 hours of lectures)

Theory

Parallel programming patterns

Performance evaluation of parallel programs

Parallel programming practice

Shared-memory programming with C/OpenMP

GPU programming with CUDA/C

(7)

High Performance Computing 7

References

Peter S. Pacheco, An Introduction to Parallel Programming, Morgan

Kaufmann 2011, ISBN 9780123742605

Theory + C/OpenMP programming

CUDA C programming guide

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

CUDA/C programming

See the Web page for slides and links to online material

https://www.moreno.marzolla.name/

(8)

High Performance Computing 8

Prerequisites

Mandatory

You must be good at C programming

Strongly suggested

(Foundations of) Operating Systems

(Foundations of) Computer Architectures

(Foundations of) Algorithms and Data Structures

(9)

High Performance Computing 9

Hardware resources

isi-raptor03.csr.unibo.it

Dual socket Xeon, 12 cores, 64 GB RAM, Ubuntu 16.04

3x NVidia GeForce GTX 1070

(10)

High Performance Computing 10

Exam

Individual programming project + written report

Specification defined by the instructor

One project for the whole academic year

The project can be handed in at any time

There is no discussion, unless I need explanations

Final grade

Weighted average of modules 1+3 (weight 4) and module 2 (weight 2)

Rounded to the nearest integer

Honors (“lode”) assigned by the instructors for outstanding work

Grades remain valid until sep 30, 2021

After that, a new edition of this course starts

(11)

High Performance Computing 11

Grading the programming project

Correctness

Clarity

Efficiency

Quality of the written report

Proper grammar, syntax, ...

Technical correctness

Performance evaluation

(12)

High Performance Computing 12

Questions?

(13)

High Performance Computing 13

Intro to parallel programming

(14)

High Performance Computing 14

High Performance Computing

Many applications need considerable computing power

Weather forecast, climate modeling, physics simulation, product engineering, 3D animation, finance, ...

Why?

To solve more complex problems

To solve the same problem in less time

To solve the same problem more accurately

To make better use of available computing resources

(15)

High Performance Computing 15

Parallel programming

“Traditional” scientific paradigm

Make a theory, then experiment

“Traditional” engineering paradigm

Design, then build

Enter numeric experimentation and prototyping

Some phenomena are too complex to be modeled accurately (e.g., weather forecast)

Some experiments are too complex, or costly, or dangerous, or impossible to do in the lab (e.g., wind tunnels, seismic

simulations, stellar dynamics...)

Computational science

Numerical simulations are becoming a new way to “do science”

Slide credits: S. Orlando

(16)

High Performance Computing 16

Applications:

Neural Networks

(17)

Applications:

Molecular dynamics

(18)

Intro to Parallel Programming 18

Applications:

Cosmological Simulation

Bolshoi simulation https://vimeo.com/29769051

The Bolshoi Simulation recreates the large-scale structure of the universe;

it required 6 million CPU hours on NASA's Pleiades Supercomputer

Source :

https://www.nas.nasa.gov/hecc/resources/pleiades.html

(19)

Intro to Parallel Programming 19

Moore's Law

"The number of transistors on an IC doubles every 24 months"

That used to mean that every new generation of processors was based on smaller transistors

Moore, G.E., Cramming more components onto

integrated circuits. Electronics, 38(8), April 1965 Gordon E.

Moore (1929– )

(20)

Intro to Parallel Programming 20 By Wgsimon - Own work, CC BY-SA 3.0,

https://commons.wikimedia.org/w/index.php?curid=15193542 Log scale

(21)

Intro to Parallel Programming 21

Physics lesson

Smaller transistors → Faster processor

Faster processor → Higher power consumption

Higher power consumption → More heat produced

More heat produced → Unreliable processor

(22)

Intro to Parallel Programming 22

Power

The power required by an IC (e.g., a processor) can be expressed as

Power = C ´ V

2

´ f where:

C is the capacitance (ability of a circuit to store energy)

V is the voltage

f is the frequency at which the processor operates

(23)

Intro to Parallel Programming 23

Power

Processor

Input Output

f

Processor

Processor

Input Output

f / 2 f / 2

f

Credits: Tim Mattson

Capacitance C Voltage V

Frequency f Power = C V 2 f

Capacitance 2.2 C Voltage 0.6 V

Frequency 0.5 f

Power = 0.396 C V 2 f

(24)

Intro to Parallel Programming 24

Source: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

(25)

High Performance Computing 25

Processor/Memory wall

Source: John L. Hennessy, David A. Patterson, Computer Architecture: a

Quantitative Approach, Fifth Ed., Morgan Kaufman 2012, ISBN: 978-0-12-383872-8

(26)

High Performance Computing 26

Limits

There are limits to “automatic” improvement of scalar performance:

The Power Wall: Clock frequency cannot be increased without exceeding air cooling

The Memory Wall: Access to data is a limiting factor

The ILP Wall: All the existing instruction-level parallelism (ILP) is already being used

Conclusion:

Explicit parallel mechanisms and explicit parallel programming are required for performance scaling

Slide credits: Hebenstreit, Reinders, Robison, McCool, SC13 tutorial

(27)

Intro to Parallel Programming 27

What happens today?

HW designers create processors with more cores

Result:

parallel hardware is ubiquitous

parallel software is rare

The challenge

Make parallel

software as common as parallel hardware

NVidia Tegra 4 SoC

(28)

High Performance Computing 28

Parallel programming in brief

Decompose the problem in sub-problems

Distribute sub-problems to the available execution units

Solve sub-problems independently

Cooperate to solve sub-problems

Goals

Reduce the wall-clock time

Balance the workload across execution units

Reduce communication and synchronization overhead

Slide credits: S. Orlando

(29)

High Performance Computing 29

Concurrency vs Parallelism

Slide credits: Tim Mattson, Intel

Task 1 Task 2 Task 3

Concurrency without parallelism Parallelism

(30)

High Performance Computing 30

The “Holy Grail”

Write serial code and have a “smart” compiler capable of parallelizing programs automatically

It has been done in some very specific cases

In practice, no compiler proved to be “smart” enough

Writing efficient parallel code requires that the

programmer understands, and makes explicit use of,

the underlying hardware

Here we come

(31)

High Performance Computing 31

Parallel programming is difficult

Serial version

(~49 lines of C++ code) Parallel version

(~1000 lines of C/C++ code)

http://www.moreno.marzolla.name/software/svmcell/

.

(32)

High Performance Computing 32

Issues of parallel programming

Writing parallel programs is in general much harder than writing sequential code

There is limited portability across different types of architectures

E.g., a distributed-memory parallel program must be rewritten from scratch to run on a GPU

However, there are standards (OpenMP, MPI, OpenCL) that allow portability across the same type of parallel architecture

Tuning for best performance is time-consuming

(33)

Intro to Parallel Programming 33

Example

Sum-reduction of an array

("Hello, world!" of parallel programming)

(34)

Intro to Parallel Programming 34

Sum-Reduction

We start assuming a shared-memory architecture

All execution units share a common memory space

We begin with a sequential solution and parallelize it

This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!

However, it is sometimes a reasonable starting point

Credits: Mary Hall

(35)

Intro to Parallel Programming 35

Sequential algorithm

Compute the sum of the content of an array A of length n

float seq_sum(float* A, int n) {

int i;

float sum = 0.0;

for (i=0; i<n; i++) { sum += A[i];

}

return sum;

}

Credits: Mary Hall

(36)

Intro to Parallel Programming 36

Version 1 (Wrong!)

Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

sum += my_x;

}

Proc 0 Proc 1 Proc 2

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

Race condition!

.

(37)

Intro to Parallel Programming 37

Version 1 (better, but still wrong)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

mutex_lock(&m);

sum += my_x;

mutex_unlock(&m);

}

WRONG

Proc 0 Proc 1 Proc 2

(38)

Intro to Parallel Programming 38

Version 1 (better, but still wrong)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Esempio: n = 17, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

mutex_lock(&m);

sum += my_x;

mutex_unlock(&m);

}

?? ??

Proc 0 Proc 1 Proc 2

WRONG

(39)

Intro to Parallel Programming 39

Version 1

(correct, but not efficient)

Assuming P processors, each one computes a partial sum of n / P adjacent elements

Example: n = 17, P = 3

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

mutex_lock(&m);

sum += my_x;

mutex_unlock(&m);

}

Proc 0 Proc 1 Proc 2

(40)

Intro to Parallel Programming 40

Version 2

Too much contention on mutex m

Each processor acquires and releases the mutex for each element of the array!

Solution: increase the mutex granularity

Each processor accumulates the partial sum on a local (private) variable

The mutex is used at the end to update the global sum

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; my_sum = 0.0; mutex m;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

my_sum += my_x;

}

mutex_lock(&m);

sum += my_sum;

mutex_unlock(&m);

(41)

Intro to Parallel Programming 41

Version 3: Remove the mutex

(wrong, in a subtle way)

We use a shared array psum[] where each processor can store its local sum

At the end, one processor computes the global sum

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0; /* all elements set to 0.0 */

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

psum[my_id] += my_x;

}

if ( 0 == my_id ) { /* only the master executes this */

sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

WRONG

(42)

Intro to Parallel Programming 42

The problem with version 3

Processor 0 could start the computation of the global sum before all other processors have computed the local sums!

Compute local sums

Compute global sum

P0 P1 P2 P3

(43)

Intro to Parallel Programming 43

Version 4 (correct)

Use a barrier synchronization

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

psum[my_id] += my_x;

}

barrier();

if ( 0 == my_id ) { sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

Compute local sums

Compute global sum

P0 P1 P2 P3

barrier()

(44)

Intro to Parallel Programming 44

Version 5

Distributed-memory version

P << n processors

Each processor

computes a local sum

Each processor

sends the local sum to processor 0 (the master)

...my_sum = 0.0;

my_start = …, my_end = …;

for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);

}

if ( 0 == my_id ) {

for ( i=1; i<P; i++ ) {

tmp = receive from proc i;

my_sum += tmp;

}

printf(“The sum is %f\n”, my_sum);

} else {

send my_sum to thread 0;

}

(45)

Intro to Parallel Programming 45

Version 5

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

15 4 2 9 3 8 11

Bottleneck

.

(46)

Intro to Parallel Programming 46

Parallel reduction

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

5 -1 7

6

15 4

9

(P – 1) sums are still performed; however, processor 0 receives

~ log2 P messages and performs ~ log2 P sums

(47)

High Performance Computing 47

Task parallelism vs Data parallelism

Task Parallelism

Distribute (possibly different) tasks to processors

Data Parallelism

Distribute data to processors

Each processor executes the same task on different data

(48)

High Performance Computing 48

Example

We have a table containing hourly temperatures on some location

24 columns, 365 rows

Compute the minimum, maximum and average temperatore for each day

Assume we have 3 independent processors

(49)

High Performance Computing 49

Example

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

Days (0364)

(50)

High Performance Computing 50

Data parallel approach

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

Days (0364)

Proc 0

Proc 1

Proc 2

(51)

High Performance Computing 51

Task parallel approach

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

Days (0364)

P ro c 0 P ro c 1 P ro c 2

(52)

High Performance Computing 52

Key concepts

Parallel architectures “naturally” derive from physics laws

Parallel architectures require parallel programming paradigms

Writing parallel programs is much harder than writing

sequential programs

Riferimenti

Documenti correlati

Così, accanto a un libertinismo “erudito” – per utilizzare una classificazione a lungo utilizzata e per certi versi ancora diffusa – caratterizzato dalla rivendicazione di una

● The cache contains a copy of the content of some recently accessed (main) memory locations. – If the same memory location is accessed again, and its content is in cache, then

Single Instruction Stream Multiple Data

[r]

◦ a written test: 4 questions, 1.30 hours (2/5 of the final mark). ◦ a practical project (3/5 of the

Mi sono sforzato, in sua assenza, di dire la verità -quella che a me sembra sia la verità- su Passavamo sulla terra leggeri e su Bellas mariposas, così come, lui presente, l’avevo

In summary, the theoretically guided empirical insights in the five empirical chapters seen together show that the formulation of rules has led to a partial centralisation

It also discusses a particular case which is intermediate between the selfish and the interdependent ones; precisely, we suppose that the utility function of each trader depends on