Architectures and Architectures and Platforms for AI module 2 Platforms for AI module 2

(1)

Architectures and Architectures and

Platforms for AI module 2 Platforms for AI module 2

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

http://www.moreno.marzolla.name/

Pacheco, Chapter 1

(2)

High Performance Computing 2

(3)

Credits

●

prof. Salvatore Orlando, Univ. Ca' Foscari di Venezia http://www.dsi.unive.it/~orlando/

●

prof. Mary Hall, University of Utah https://www.cs.utah.edu/~mhall/

●

Tim Mattson, Intel

(4)

Who I am

●

Moreno Marzolla

– Associate professor @ DISI (Cesena campus)

– https://www.moreno.marzolla.name/

●

Recent teaching activity

– High Performance Computing @ ISI

– Foundations of Computer Science

– Algorithms and Data Structures

●

Research activity

– Parallel programming

– Modeling and simulation

(5)

Architectures and Platforms for AI module 2

●

Web page

– https://www.moreno.marzolla.name/teaching/APAI

●

Schedule

– Monday 11:00—13:30 Room 0.6

– Check the Web page and the official course timetable for variations

●

Office hours

– On demand via Teams; please send e-mail

(6)

Syllabus

●

2 CFU (~ 24 hours of lectures)

●

Theory

–

Parallel programming patterns

–

Performance evaluation of parallel programs

●

Parallel programming practice

–

Shared-memory programming with C/OpenMP

–

GPU programming with CUDA/C

(7)

References

●

Peter S. Pacheco, An Introduction to Parallel Programming, Morgan

Kaufmann 2011, ISBN 9780123742605

– Theory + C/OpenMP programming

●

CUDA C programming guide

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

– CUDA/C programming

●

See the Web page for slides and links to online material

https://www.moreno.marzolla.name/

(8)

Prerequisites

●

Mandatory

– You must be good at C programming

●

Strongly suggested

– (Foundations of) Operating Systems

– (Foundations of) Computer Architectures

– (Foundations of) Algorithms and Data Structures

(9)

Hardware resources

●

isi-raptor03.csr.unibo.it

– Dual socket Xeon, 12 cores, 64 GB RAM, Ubuntu 16.04

– 3x NVidia GeForce GTX 1070

(10)

Exam

●

Individual programming project + written report

– Specification defined by the instructor

– One project for the whole academic year

– The project can be handed in at any time

– There is no discussion, unless I need explanations

●

Final grade

– Weighted average of modules 1+3 (weight 4) and module 2 (weight 2)

– Rounded to the nearest integer

– Honors (“lode”) assigned by the instructors for outstanding work

●

Grades remain valid until sep 30, 2021

– After that, a new edition of this course starts

(11)

Grading the programming project

●

Correctness

●

Clarity

●

Efficiency

●

Quality of the written report

– Proper grammar, syntax, ...

– Technical correctness

– Performance evaluation

(12)

Questions?

(13)

Intro to parallel programming

(14)

High Performance Computing

●

Many applications need considerable computing power

– Weather forecast, climate modeling, physics simulation, product engineering, 3D animation, finance, ...

●

Why?

– To solve more complex problems

– To solve the same problem in less time

– To solve the same problem more accurately

– To make better use of available computing resources

(15)

Parallel programming

●

“Traditional” scientific paradigm

– Make a theory, then experiment

●

“Traditional” engineering paradigm

– Design, then build

●

Enter numeric experimentation and prototyping

– Some phenomena are too complex to be modeled accurately (e.g., weather forecast)

– Some experiments are too complex, or costly, or dangerous, or impossible to do in the lab (e.g., wind tunnels, seismic

simulations, stellar dynamics...)

●

Computational science

– Numerical simulations are becoming a new way to “do science”

Slide credits: S. Orlando

(16)

Applications:

Neural Networks

(17)

Applications:

Molecular dynamics

(18)

Intro to Parallel Programming 18

Applications:

Cosmological Simulation

Bolshoi simulation https://vimeo.com/29769051

The Bolshoi Simulation recreates the large-scale structure of the universe;

it required 6 million CPU hours on NASA's Pleiades Supercomputer

Source :

https://www.nas.nasa.gov/hecc/resources/pleiades.html

(19)

Moore's Law

"The number of transistors on an IC doubles every 24 months"

●

That used to mean that every new generation of processors was based on smaller transistors

Moore, G.E., Cramming more components onto

integrated circuits. Electronics, 38(8), April 1965 Gordon E.

Moore (1929– )

(20)

Intro to Parallel Programming 20 By Wgsimon - Own work, CC BY-SA 3.0,

https://commons.wikimedia.org/w/index.php?curid=15193542 Log scale

(21)

Physics lesson

● Smaller transistors → Faster processor

● Faster processor → Higher power consumption

● Higher power consumption → More heat produced

● More heat produced → Unreliable processor

(22)

Power

●

The power required by an IC (e.g., a processor) can be expressed as

Power = C ´ V

²

´ f where:

– C is the capacitance (ability of a circuit to store energy)

– V is the voltage

– f is the frequency at which the processor operates

(23)

Power

Processor

Input Output

f

Processor

Input Output

f / 2 f / 2

f

Credits: Tim Mattson

Capacitance C Voltage V

Frequency f Power = C V ² f

Capacitance 2.2 C Voltage 0.6 V

Frequency 0.5 f

Power = 0.396 C V ² f

(24)

Source: https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

(25)

Processor/Memory wall

Source: John L. Hennessy, David A. Patterson, Computer Architecture: a

Quantitative Approach, Fifth Ed., Morgan Kaufman 2012, ISBN: 978-0-12-383872-8

(26)

Limits

●

There are limits to “automatic” improvement of scalar performance:

– The Power Wall: Clock frequency cannot be increased without exceeding air cooling

– The Memory Wall: Access to data is a limiting factor

– The ILP Wall: All the existing instruction-level parallelism (ILP) is already being used

●

Conclusion:

– Explicit parallel mechanisms and explicit parallel programming are required for performance scaling

Slide credits: Hebenstreit, Reinders, Robison, McCool, SC13 tutorial

(27)

What happens today?

●

HW designers create processors with more cores

●

Result:

– parallel hardware is ubiquitous

– parallel software is rare

●

The challenge

– Make parallel

software as common as parallel hardware

NVidia Tegra 4 SoC

(28)

Parallel programming in brief

●

Decompose the problem in sub-problems

●

Distribute sub-problems to the available execution units

●

Solve sub-problems independently

– Cooperate to solve sub-problems

●

Goals

– Reduce the wall-clock time

– Balance the workload across execution units

– Reduce communication and synchronization overhead

Slide credits: S. Orlando

(29)

Concurrency vs Parallelism

Slide credits: Tim Mattson, Intel

Task 1 Task 2 Task 3

Concurrency without parallelism Parallelism

(30)

The “Holy Grail”

●

Write serial code and have a “smart” compiler capable of parallelizing programs automatically

●

It has been done in some very specific cases

– In practice, no compiler proved to be “smart” enough

●

Writing efficient parallel code requires that the

programmer understands, and makes explicit use of,

the underlying hardware

Here we come

(31)

Parallel programming is difficult

Serial version

(~49 lines of C++ code) Parallel version

(~1000 lines of C/C++ code)

http://www.moreno.marzolla.name/software/svmcell/

.

(32)

Issues of parallel programming

●

Writing parallel programs is in general much harder than writing sequential code

●

There is limited portability across different types of architectures

– E.g., a distributed-memory parallel program must be rewritten from scratch to run on a GPU

– However, there are standards (OpenMP, MPI, OpenCL) that allow portability across the same type of parallel architecture

●

Tuning for best performance is time-consuming

(33)

Example

Sum-reduction of an array

("Hello, world!" of parallel programming)

(34)

Sum-Reduction

●

We start assuming a shared-memory architecture

– All execution units share a common memory space

●

We begin with a sequential solution and parallelize it

– This is not always a good idea; some parallel algorithms have nothing in common with their sequential counterparts!

– However, it is sometimes a reasonable starting point

Credits: Mary Hall

(35)

Sequential algorithm

●

Compute the sum of the content of an array A of length n

float seq_sum(float* A, int n) {

int i;

float sum = 0.0;

for (i=0; i<n; i++) { sum += A[i];

}

return sum;

}

Credits: Mary Hall

(36)

Version 1 (Wrong!)

●

Assuming P execution units (e.g., processors), each one computes a partial sum of n / P adjacent elements

●

Example: n = 15, P = 3

my_block_len = n/P;

my_start = my_id * my_block_len;

my_end = my_start + my_block_len;

sum = 0.0;

for (my_i=my_start; my_i<my_end; my_i++) { my_x = get_value(my_i);

sum += my_x;

}

Proc 0 Proc 1 Proc 2

WRONG

Variables whose names start with my_ are assumed to be local (private) to each processor; all other

variables are assumed to be global (shared)

Race condition!

.

(37)

Version 1 (better, but still wrong)

●

Assuming P processors, each one computes a partial sum of n / P adjacent elements

●

Example: n = 15, P = 3

my_block_len = n/P;

sum = 0.0; mutex m;

mutex_lock(&m);

sum += my_x;

mutex_unlock(&m);

}

WRONG

(38)

Version 1 (better, but still wrong)

●

Assuming P processors, each one computes a partial sum of n / P adjacent elements

●

Esempio: n = 17, P = 3

my_block_len = n/P;

sum = 0.0; mutex m;

mutex_lock(&m);

sum += my_x;

}

?? ??

WRONG

(39)

Version 1

(correct, but not efficient)

●

Assuming P processors, each one computes a partial sum of n / P adjacent elements

●

Example: n = 17, P = 3

my_start = n * my_id / P;

my_end = n * (my_id + 1) / P;

sum = 0.0; mutex m;

mutex_lock(&m);

sum += my_x;

}

(40)

Version 2

● Too much contention on mutex m

– Each processor acquires and releases the mutex for each element of the array!

● Solution: increase the mutex granularity

– Each processor accumulates the partial sum on a local (private) variable

– The mutex is used at the end to update the global sum

my_end = n * (my_id + 1) / P;

sum = 0.0; my_sum = 0.0; mutex m;

my_sum += my_x;

}

mutex_lock(&m);

sum += my_sum;

mutex_unlock(&m);

(41)

Version 3: Remove the mutex

(wrong, in a subtle way)

●

We use a shared array psum[] where each processor can store its local sum

●

At the end, one processor computes the global sum

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0; /* all elements set to 0.0 */

psum[my_id] += my_x;

}

if ( 0 == my_id ) { /* only the master executes this */

sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

WRONG

(42)

The problem with version 3

●

Processor 0 could start the computation of the global sum before all other processors have computed the local sums!

Compute local sums

Compute global sum

P0 P1 P2 P3

(43)

Version 4 (correct)

●

Use a barrier synchronization

my_end = n * (my_id + 1) / P;

psum[0..P-1] = 0.0;

psum[my_id] += my_x;

}

barrier();

if ( 0 == my_id ) { sum = 0.0;

for (my_i=0; my_i<P; my_i++) sum += psum[my_i];

}

Compute local sums

Compute global sum

P0 P1 P2 P3

barrier()

(44)

Version 5

Distributed-memory version

●

P << n processors

●

Each processor

computes a local sum

●

Each processor

sends the local sum to processor 0 (the master)

...my_sum = 0.0;

my_start = …, my_end = …;

for ( i = my_start; i < my_end; i++ ) { my_sum += get_value(i);

}

if ( 0 == my_id ) {

for ( i=1; i<P; i++ ) {

tmp = receive from proc i;

my_sum += tmp;

}

printf(“The sum is %f\n”, my_sum);

} else {

send my_sum to thread 0;

}

(45)

Version 5

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

15 4 2 9 3 8 11

Bottleneck

.

(46)

Parallel reduction

Proc 0

A[ ]

my_sum

Proc 1 Proc 2 Proc 3 Proc 4 Proc 5 Proc 6 Proc 7

1 3 -2 7 -6 5 3 4

5 -1 7

6

15 4

9

● (P – 1) sums are still performed; however, processor 0 receives

~ log₂ P messages and performs ~ log₂ P sums

(47)

Task parallelism vs Data parallelism

●

Task Parallelism

– Distribute (possibly different) tasks to processors

●

Data Parallelism

– Distribute data to processors

– Each processor executes the same task on different data

(48)

Example

●

We have a table containing hourly temperatures on some location

– 24 columns, 365 rows

●

Compute the minimum, maximum and average temperatore for each day

●

Assume we have 3 independent processors

(49)

Example

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

Days (0—364)

(50)

Data parallel approach

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

Days (0—364)

Proc 0

Proc 1

Proc 2

(51)

Task parallel approach

max min ave

0 1 2 3 22 23

Hour (0—23)

0 1 2

364

Days (0—364)

P ro c 0 P ro c 1 P ro c 2

(52)

Key concepts

●

Parallel architectures “naturally” derive from physics laws

●

Parallel architectures require parallel programming paradigms

●