• Non ci sono risultati.

Performance evaluation of Performance evaluation of parallel programs parallel programs

N/A
N/A
Protected

Academic year: 2021

Condividi "Performance evaluation of Performance evaluation of parallel programs parallel programs"

Copied!
34
0
0

Testo completo

(1)

Performance evaluation of Performance evaluation of

parallel programs parallel programs

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

[email protected] Pacheco, Section 2.6

(2)
(3)

Performance Evaluation 3

Scalability

How much faster can a given problem be solved with p workers instead of one?

How much more work can be done with p workers instead of one?

What impact for the communication requirements of the parallel application have on performance?

What fraction of the resources is actually used productively for solving the problem?

Slide credit: prof. David Padua

(4)

Speedup

Let us define:

p = Number of processors / cores

T

serial

= Execution time of the serial program

T

parallel

(p) = Execution time of the parallel program with p

processors / cores

(5)

Performance Evaluation 5

Speedup

Speedup S(p)

In the ideal case, the parallel program requires 1/p the time of the sequential program

S(p) = p is the ideal case of linear speedup

Realistically, S(p) ≤ p

Is it possible to observe S(p) > p ?

S ( p) = T

serial

T

parallel

( p)T

parallel

( 1)

T

parallel

( p)

(6)

Warning

Never use a serial program to compute T

serial

If you do that, you might see a spurious superlinear speedup that is not there

Always use the parallel program with p = 1 processors

(7)

Performance Evaluation 7

Non-parallelizable portions

Suppose that a fraction α of the total execution time of the serial program can not be parallelized

E.g., due to:

Algorithmic limitations (data dependencies)

Bottlenecks (e.g., shared resources)

Startup overhead

Communication costs

Suppose that the remaining fraction (1 - α) can be fully parallelized

Then, we have:

T

parallel

( p) = α T

serial

+ ( 1−α)T

serial

p

(8)

Example

Intrinsecally serial part

Parallelizable part

p = 1 p = 2 p = 4

Tserial α Tserial(1- α) Tserial

T

parallel

( p)=α T

serial

+ ( 1−α)T

serial

p

Tparallel(2) Tparallel(4)

(9)

Performance Evaluation 9

Example

Suppose that a program has T

serial

= 20s

Assume that 10% of the time is spent in a serial portion of the program

Therefore, the execution time of a parallel version with p processors is

T

parallel

( p) = 0.1T

serial

+ 0.9 T

serial

p = 2+ 18

p

(10)

Example (cont.)

The speedup is

What is the maximum speedup that can be achieved when p → +∞ ?

S ( p) = T

serial

0.1×T

serial

+ 0.9×T

serial

p

= 20 2+ 18

p

(11)

Performance Evaluation 11

S ( p) = T

serial

T

parallel

( p)

= T

serial

α T

serial

+ ( 1−α)T

serial

p

= 1

α+ 1−α p

Amdahl's Law

What is the maximum speedup?

Gene Myron Amdahl (1922—)

Amdahl's Law

(12)

Amdahl's Law

From

we get an asymptotic speedup 1 / α when p grows to infinity

S ( p) = 1

α+ ( 1−α) p

If a fraction α of the total execution time is spent on a serial portion of the program, then the

maximum achievable speedup is 1/α

(13)
(14)

Scaling Efficiency

Strong Scaling: increase the number of processors p keeping the total problem size fixed

The total amount of work remains constant

The amount of work for a single processor decreases as p increases

Goal: reduce the total execution time by adding more processors

Weak Scaling: increase the number of processors p keeping the per-processor work fixed

The total amount of work grows as p increases

Goal: solve larger problems within the same amount of time

(15)

Performance Evaluation 15

Strong Scaling Efficiency

E(p) = Strong Scaling Efficiency

where

T

parallel

(p) = Execution time of the parallel program with p

processors / cores

E ( p) = S ( p)

p = T

parallel

( 1)

p×T

parallel

( p)

(16)

Strong Scaling Efficiency and Amdahl's Law

(17)

Performance Evaluation 17

Weak Scaling Efficiency

W(p) = Weak Scaling Efficiency

where

T

1

= time required to complete 1 work unit with 1 processor

T

p

= time required to complete p work units with p processors

W ( p)= T

1

T

p

(18)

Example

See omp-matmul.c

demo-strong-scaling.sh computes the times required for strong scaling (demo-strong-scaling.ods)

demo-weak-scaling.sh computes the times required for weak scaling (demo-weak-scaling.ods)

Important note

For a given n, the amount of “work” required to compute the product of two n ´ n matrices is ~ n

3

To double the total work, do not double n!

The amount of work required to compute the product of two (2n ´ 2n) matrices is ~ 8n3, eight times the n ´ n case

(19)

Performance Evaluation 19

Taking times

To compute speedup and efficiencies one needs to take the program wall clock time, excluding the

input/output time

#include "hpc.h"

...

double start, finish;

/* read input data; this should not be measured */

start = hpc_gettime();

/* actual code that we want to time */

finish = hpc_gettime();

printf(“Elapsed time = %e seconds\n”, finish – start);

/* write output data; this should not be measured */

The actual computation is done here

(20)

Post Scriptum:

How to present results

(21)

Performance Evaluation 21

Basic rules for graphing data

Understanding the data should require the least possible effort from the reader

E.g., use names instead of symbols for the legend when possible

Provide enough information to make the graph self- contained

Stick to widely accepted practices

The origin of plots should be (0, 0)

The effect should be on the y axis, the cause on the x axis

(22)

Stick to accepted practices

Bad

(23)

Performance Evaluation 23

Do not misuse axis origins

Playing with the axis ranges allows one to emphasize small differences

2011 2012 2013 2014 2015

1740 1760 1780 1800 1820 1840 1860 1880

Year

Sales

2011 2012 2013 2014 2015

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Year

Sales

Bad Good

(24)
(25)

Performance Evaluation 25

Do not misuse axis origins

a change from 0.6% to 0.7% according to the Daily Mail

http://www.dailymail.co.uk/news/article-4248690/Economy-grew-0-7-final-three-months-2016.html

Bad

(26)

Do not misuse axis origins

a change from 0.6% to 0.7% with proper scale

First estimate Second estimate

2916 Q4 growth upgraded

Good

(27)

Performance Evaluation 27

Prefer names to symbols

m=1

m=2

m=3

1 job/sec

2 job/sec 3 job/sec

Bad Good

(28)

Show all data

?

Bad

?

?

(29)

Performance Evaluation 29

Do not overlap related data

Wall clock time Speedup

Num. of processors

Bad

(30)

Do not connect points when intermediate values can not exist

MIPS

P4 Core i3 ARM Core i7

Bad

MIPS

P4 Core i3 ARM Core i7

Better

(31)

Performance Evaluation 31

Do not use graphs when words are enough

123.45

Bad

I have actually seen this in a thesis

(32)

Other suggestions

Always add a (numbered) caption to each figure

If possible, the caption should contain enough information to understand the figure without reading the main text

All figures must be referenced and described in the

main text of your document

(33)

Performance Evaluation 33

“The best statistical graphic ever drawn”

(Edward Tufte)

Modern redrawing of Charles Minard's map of Napoleon's disastrous Russian campaign of 1812.

The graphic is notable for its representation in two dimensions of six types of data: the number of Napoleon's troops; distance; temperature; the latitude and longitude; direction of travel; and location relative to specific dates (Image and description from Wikipedia)

(34)

Suggested reading

Riferimenti

Documenti correlati

● Strong Scaling: increase the number of processors p keeping the total problem size fixed.. – The total amount of work

● Strong Scaling: increase the number of processors p keeping the total problem size fixed. – The total amount of work

Irrigation in the higher plain plays an important and positive role on groundwater quantity by increasing recharge over natural rates, and on water quality by diluting

Measurements of unsaturated hydraulic conductivity were grouped into different classes of soil rehabilitation, soil compaction and vegeta- tion cover to investigate the effects of

The measured PM 2.5 mass was reconstructed using a multiple-site hybrid chemical mass closure approach that also accounts for aerosol inorganic water content (AWC) estimated by

Finally, merging these data with the total iron measurements collected using inductively coupled plasma time of flight mass spectrometry (ICP-TOF) analysis, we present the first

Monthly (a) and weekly-hourly (b) average patterns of PNSDs calculated over all the data.. subtracting the concentration modeled by the STL of bin i at the month j-1. The STL

The methodology was tested on the case study of outdoor recreation in the Alps, where recreational values were aggregated prior to any spa- tial autocorrelation analysis, to address