Performance evaluation of Performance evaluation of
parallel programs parallel programs
Moreno Marzolla
Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna
[email protected] Pacheco, Section 2.6
Performance Evaluation 3
Scalability
●
How much faster can a given problem be solved with p workers instead of one?
●
How much more work can be done with p workers instead of one?
●
What impact for the communication requirements of the parallel application have on performance?
●
What fraction of the resources is actually used productively for solving the problem?
Slide credit: prof. David Padua
Speedup
●
Let us define:
–
p = Number of processors / cores
–
T
serial= Execution time of the serial program
–
T
parallel(p) = Execution time of the parallel program with p
processors / cores
Performance Evaluation 5
Speedup
●
Speedup S(p)
●
In the ideal case, the parallel program requires 1/p the time of the sequential program
●
S(p) = p is the ideal case of linear speedup
–
Realistically, S(p) ≤ p
–
Is it possible to observe S(p) > p ?
S ( p) = T
serialT
parallel( p) ≈ T
parallel( 1)
T
parallel( p)
Warning
●
Never use a serial program to compute T
serial–
If you do that, you might see a spurious superlinear speedup that is not there
●
Always use the parallel program with p = 1 processors
Performance Evaluation 7
Non-parallelizable portions
●
Suppose that a fraction α of the total execution time of the serial program can not be parallelized
–
E.g., due to:
● Algorithmic limitations (data dependencies)
● Bottlenecks (e.g., shared resources)
● Startup overhead
● Communication costs
●
Suppose that the remaining fraction (1 - α) can be fully parallelized
●
Then, we have:
T
parallel( p) = α T
serial+ ( 1−α)T
serialp
Example
Intrinsecally serial part
Parallelizable part
p = 1 p = 2 p = 4
Tserial α Tserial(1- α) Tserial
T
parallel( p)=α T
serial+ ( 1−α)T
serialp
Tparallel(2) Tparallel(4)
Performance Evaluation 9
Example
●
Suppose that a program has T
serial= 20s
●
Assume that 10% of the time is spent in a serial portion of the program
●
Therefore, the execution time of a parallel version with p processors is
T
parallel( p) = 0.1T
serial+ 0.9 T
serialp = 2+ 18
p
Example (cont.)
●
The speedup is
●
What is the maximum speedup that can be achieved when p → +∞ ?
S ( p) = T
serial0.1×T
serial+ 0.9×T
serialp
= 20 2+ 18
p
Performance Evaluation 11
S ( p) = T
serialT
parallel( p)
= T
serialα T
serial+ ( 1−α)T
serialp
= 1
α+ 1−α p
Amdahl's Law
●
What is the maximum speedup?
Gene Myron Amdahl (1922—)
Amdahl's Law
Amdahl's Law
●
From
we get an asymptotic speedup 1 / α when p grows to infinity
S ( p) = 1
α+ ( 1−α) p
If a fraction α of the total execution time is spent on a serial portion of the program, then the
maximum achievable speedup is 1/α
Scaling Efficiency
●
Strong Scaling: increase the number of processors p keeping the total problem size fixed
–
The total amount of work remains constant
–
The amount of work for a single processor decreases as p increases
–
Goal: reduce the total execution time by adding more processors
●
Weak Scaling: increase the number of processors p keeping the per-processor work fixed
–
The total amount of work grows as p increases
Goal: solve larger problems within the same amount of time
Performance Evaluation 15
Strong Scaling Efficiency
●
E(p) = Strong Scaling Efficiency
●
where
–
T
parallel(p) = Execution time of the parallel program with p
processors / cores
E ( p) = S ( p)
p = T
parallel( 1)
p×T
parallel( p)
Strong Scaling Efficiency and Amdahl's Law
Performance Evaluation 17
Weak Scaling Efficiency
●
W(p) = Weak Scaling Efficiency
●
where
–
T
1= time required to complete 1 work unit with 1 processor
–
T
p= time required to complete p work units with p processors
W ( p)= T
1T
pExample
●
See omp-matmul.c
–
demo-strong-scaling.sh computes the times required for strong scaling (demo-strong-scaling.ods)
–
demo-weak-scaling.sh computes the times required for weak scaling (demo-weak-scaling.ods)
●
Important note
–
For a given n, the amount of “work” required to compute the product of two n ´ n matrices is ~ n
3–
To double the total work, do not double n!
● The amount of work required to compute the product of two (2n ´ 2n) matrices is ~ 8n3, eight times the n ´ n case
Performance Evaluation 19
Taking times
●
To compute speedup and efficiencies one needs to take the program wall clock time, excluding the
input/output time
#include "hpc.h"
...
double start, finish;
/* read input data; this should not be measured */
start = hpc_gettime();
/* actual code that we want to time */
finish = hpc_gettime();
printf(“Elapsed time = %e seconds\n”, finish – start);
/* write output data; this should not be measured */
The actual computation is done here
Post Scriptum:
How to present results
Performance Evaluation 21
Basic rules for graphing data
●
Understanding the data should require the least possible effort from the reader
–
E.g., use names instead of symbols for the legend when possible
●
Provide enough information to make the graph self- contained
●
Stick to widely accepted practices
–
The origin of plots should be (0, 0)
–
The effect should be on the y axis, the cause on the x axis
Stick to accepted practices
Bad
Performance Evaluation 23
Do not misuse axis origins
●
Playing with the axis ranges allows one to emphasize small differences
2011 2012 2013 2014 2015
1740 1760 1780 1800 1820 1840 1860 1880
Year
Sales
2011 2012 2013 2014 2015
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Year
Sales
Bad Good
Performance Evaluation 25
Do not misuse axis origins
a change from 0.6% to 0.7% according to the Daily Mail
http://www.dailymail.co.uk/news/article-4248690/Economy-grew-0-7-final-three-months-2016.html
Bad
Do not misuse axis origins
a change from 0.6% to 0.7% with proper scale
First estimate Second estimate
2916 Q4 growth upgraded
Good
Performance Evaluation 27
Prefer names to symbols
m=1
m=2
m=3
1 job/sec
2 job/sec 3 job/sec
Bad Good
Show all data
?
Bad
??
Performance Evaluation 29
Do not overlap related data
Wall clock time Speedup
Num. of processors
Bad
Do not connect points when intermediate values can not exist
MIPS
P4 Core i3 ARM Core i7
Bad
MIPS
P4 Core i3 ARM Core i7
Better
Performance Evaluation 31
Do not use graphs when words are enough
123.45
Bad
I have actually seen this in a thesis
Other suggestions
●
Always add a (numbered) caption to each figure
–
If possible, the caption should contain enough information to understand the figure without reading the main text
●
All figures must be referenced and described in the
main text of your document
Performance Evaluation 33
“The best statistical graphic ever drawn”
(Edward Tufte)
Modern redrawing of Charles Minard's map of Napoleon's disastrous Russian campaign of 1812.
The graphic is notable for its representation in two dimensions of six types of data: the number of Napoleon's troops; distance; temperature; the latitude and longitude; direction of travel; and location relative to specific dates (Image and description from Wikipedia)