Optimization techniques
Carlo Cavazzoni, HPC department, CINECA
Modern node architecture
CPU
RAM Disk
cache I, D
Small & fast
Cache
Hierarchy register L1 L2 L3 RAM L1: Instruction and data
Size: L1 … Ln Speed: L1 … Ln
CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss) and looks in L2 …
cache miss penaly in terms of clock cycle
CACHE Direct Mapped
32 Kbyte
32 Kbyte 32 Kbyte
32 Kbyte 32 Kbyte
0
32 K
64 K
128 K
cache
Cache set associative
32 Kbyte
32 Kbyte 16 Kbyte
32 Kbyte
0
32 K
64 K
128 K 16 Kbyte
16 K
48 K LastRecentlyUse
d
Round Robin
Random
Loop optimization
Loop fusion
Locality in time
do i=1, n
a(i) = b(i) + 1.0 enddo
do i=2, n
c(i) = sqrt(a(i-1)) enddo
do i=2, n
a(i) = b(i) + 1.0 c(i) = sqrt(a(i-1)) enddo
a(1) = b(1) + 1.0 if n is big enough, a is
loaded, offloaded and loaded again into cache
Reuse the a(i) loaded into
cache
Loop interchange
Locality in space
do i=1, n do j=1, n
a(i,j) = b(i,j) + 1.0 enddo
enddo
Load elements into cache lines and use only one before replacing them with new elements
Load elements into cache and use all of them before replacing them with new elements
do j=1, n do i=1, n
a(i,j) = b(i,j) + 1.0 enddo
enddo
0x000x01 0x02 0x03
a b
j j
i
i
Cache thrashing
real, dimension (1024) :: a,b COMMON /my_com/ a, b
do i=1, 1024
a(i) = b(i) + 1.0 enddo
offset shift matrixes w.r.t. cache no more problems
integer offset =
(linea_cache)/SIZE(REAL)
real, dimension (1024+offset) :: a,b COMMON /my_com/ a, b
do i=1, 1024
a(i) = b(i) + 1.0 enddo
Padding
size cache = 4*1024, direct mapped, a, b contiguous cache thrashing
array size = multiple of cache size
Loop unrolling
do j=1, n
do i=1, (n-1)
a(i,j)= b(i,j)+b(i+1,j)+1.0 enddo
enddo
Equivalent Loops.
Fewer jump.
Fewer dependencies.
Fill pipelines and vector units.
do j=1, n
do i=1, (n-1), 2
a(i,j) = b(i,j) +b(i+1,j)+1.0 a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0 enddo
enddo
Optimize with
numerical libraries
Less coding
Tested and (almost) bug free Standard
Efficient implementation
Optimized
BLAS
Basic Linear Algebra Subprogram, Parallel BLAS and Basic Linear Algebra Communication Subsystem (www.netlib.org)
• Level 1 BLAS: Vector-Vector operations (scalar only).
• Level 2 BLAS, PBLAS: Vector-Matrix operations (scalar and parallel).
• Level 3 BLAS, PBLAS: Matrix-Matrix operation (scalar and parallel).
• Level 1 and 2 BLACS: vector reduction,
vector and matrics communications.
Lapack and Scalapack
Linear Algebra Package and Scalable Lapack (www.netlib.org)
Matrix decomposition.
Solution of Linear Systems.
Eigenvalues and Eigenvetors
Linear Least Square solutions
MKL ESSL ACML
CUBLAS
MAGMA
PLASMA
MASS (IBM)
• Accelerated version of SQRT, SIN, COS, EXP, LOG, ecc…
Scalar and vector
VML
Equivalent to MASS (vector version only) For Intel processors
Accelerated version of:
sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y
VML
do i = 1, n
r = r + sin( a(i) ) end do
call vdsin( n, a, y ) do i = 1, n
r = r + y( i ) end do
CALL vml_subroutine( n, a, y )
BLAS
Matrix multiplication
DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc)
c = alpha op( a ) * op( b ) + beta c
C
lm=
nA
lnB
nm+ C
lmC
lm=
nA
TlnB
nm+ C
lmC
lm=
nA
lnB
Tnm+ C
lmC
lm=
nA
TlnB
Tnm+ C
lmreal*8 a(lda,*), b(ldb,*), c(ldc,*)
Profileing with gprof
Compiler flag “-pg” or “-p” (depend on the compiler)
gcc -pg –c mio.c ./a.out
gmon.out gprof
gcc -pg -funroll-loops –O2 dotprod.c -static
[cineca@rfxoff1 Carlo]$ ./a.out d = 1000000.000000
% cumulative self self total
time seconds seconds calls us/call us/call name
68.57 0.05 0.05 2 23437.50 23437.50 set_vector
31.43 0.07 0.02 1 21484.38 21484.38 dot_product
0.00 0.07 0.00 1 0.00 68359.38 main
gprof
Profileing “by hand”
Find “hot spot” in your application Use temporization functions
CALL SYSTEM_CLOCK(iclk1, count_rate=nclk) CALL critical_subroutine( …… )
CALL SYSTEM_CLOCK(iclk2)
PRINT *,REAL(iclk2-iclk1)/nclk
t1 = cclock()
CALL critical_subroutine( …… ) t2 = cclock()
PRINT *, (t2-t1) CALL CPU_TIME( t3 )
CALL critical_subroutine( …… ) CALL CPU_TIME( t4 )
PRINT *, (t4-t3)
Mesure performances
#include<stdio.h>
#include<time.h>
#include<ctype.h>
#include<sys/types.h>
#include<sys/time.h>
double cclock_() {
/* Restituisce il valore del CLOCK di sistema in secondi */
struct timeval tmp;
double sec;
gettimeofday( &tmp, (struct timezone *)0 );
sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0;
return sec;
}
PROGRAM test_dgemm IMPLICIT NONE
INTEGER, PARAMETER :: dim = 1000
REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k
REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock
ALLOCATE( x( dim, dim ), y( dim, dim ) ) ALLOCATE( z( dim, dim ) )
y = 1.0d0
z = 1.0d0 / DBLE( dim ) x = 0.0d0
t1 = cclock( ) do j = 1, dim do i = 1, dim do k = 1, dim
x(i,j) = x(i,j) + y(i,k) * z(k,j) end do
end do end do
t2 = cclock()
write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1
PROGRAM test_dgemm IMPLICIT NONE
INTEGER, PARAMETER :: dim = 1000
REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k
REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock
ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) ) y = 1.0d0
z = 1.0d0 / DBLE( dim ) x = 0.0d0
t1 = cclock()
! x = matmul( y, z )
call dgemm('N', 'N', dim, dim, dim, 1.0d0, y, c dim, z, dim,0.0d0, x, dim)
t2 = cclock()
write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1
ATLAS
BLAS compatible
Automatically Tuned Linear Algebra Software http://sourceforge.net/
http://math-atlas.sourceforge.net/devel/
http://www.fftw.org Fast Fourier Trasform
FFT complex to complex FFT complex to real
Advanced techniques
do i=1,n do j=1,m
y(j,i) = x(i,j) enddo
enddo
Case Study: matrix transposition
y
x
Suppose 2-way set associative
y x
For each value of x I need to load into cache a whole line.
1) X “allocate” the 2nd way.
2) Risk of thrashing
3) When the cache is full, the proc. Start to overwrite cache lines
data mapped in cache y “allocate” the 1st way
What happens with the cache
Suppose 2-way set associative
y x
As before for each value of x I need to load into cache a whole cache line.
data mapped in cache y “allocate” the 1st way
What happens with the cache, cont.
Suppose 2-way set associative
y x
Block Algorithm
Load a block of data into cache
Swap data in cache Write data back to memory
do i=1,n do j=1,m
y(j,i) = x(i,j) enddo
enddo
do ib = 1, nb
ioff = (ib-1) * bsiz do jb = 1, mb
joff = (jb-1) * bsiz do j = 1, bsiz
do i = 1, bsiz
buf(i,j) = x(i+ioff, j+joff) enddo
enddo
do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j)
buf(i,j) = buf(j,i) buf(j,i) = bswp
enddo enddo
do i=1,bsiz do j=1,bsiz
y(j+joff, i+ioff) = buf(j,i) enddo
enddo enddo bsiz = block size
nb = n / bsiz mb = m / bsiz You need to handle:
MOD(n / bsiz) /= 0 OR MOD(m / bsiz) /= 0
Solution: Block algorithm
Whole block transpose
do ib = 1, nb
ioff = (ib-1) * bsiz do jb = 1, mb
joff = (jb-1) * bsiz do j = 1, bsiz
do i = 1, bsiz
buf(i,j) = x(i+ioff,j+joff) enddo
enddo
do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo
enddo
do i=1,bsiz do j=1,bsiz
y(j+joff,i+ioff) = buf(j,i) enddo
enddo enddo
enddo
IF( min( 1, MOD(n,bsiz) ) .GT. 0 ) THEN ioff = nb * bsiz
do jb = 1, mb
joff = (jb-1) * bsiz do j = 1, bsiz
do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo
enddo
do i = 1, MIN(bsiz, n-ioff) do j = 1, bsiz
y(j+joff,i+ioff) = buf(i,j) enddo
enddo enddo
END IF
IF( MIN(1, MOD(m, bsiz)) .GT. 0 ) THEN joff = mb * bsiz
do ib = 1, nb
ioff = (ib-1) * bsiz
do j = 1, MIN(bsiz, m-joff) do i = 1, bsiz
buf(i,j) = x(i+ioff, j+joff) enddo
enddo
do i = 1, bsiz
do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo
enddo enddo END IF
IF( MIN(1,MOD(n,bsiz)).GT.0 .AND. &
& MIN(1,MOD(m,bsiz)).GT.0 ) THEN joff = mb * bsiz
ioff = nb * bsiz
do j = 1, MIN(bsiz, m-joff) do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo
enddo
do i = 1, MIN(bsiz, n-ioff) do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo
enddo
1
2
3
Performance tuning and analysis: user codes
Matrix Trasposition Matrix size: 2048x2048
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
0 20 40 60 80 100 120
execution time
Straightforward implementation
Block
implementation
DO l=1,nphase
IF(au1(l,l) /= 0.D0) THEN lp1=l+1
div=1.D0/au1(l,l) DO lj=lp1,nphase
au1(l,lj)=au1(l,lj)*div END DO
bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase
au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO
bu1(li)=bu1(li)-amul*bu1(l) END DO
END IF END DO
IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0
!li=2
amul = a(2,1)
a(2,2) = a(2,2) - amul * a(1,2) a(2,3) = a(2,3) - amul * a(1,3) b(2) = b(2) - amul * b(1) !li=3
amul = a(3,1)
a(3,2) = a(3,2) - amul * a(1,2) a(3,3) = a(3,3) - amul * a(1,3) b(3) = b(3) - amul * b(1) END IF
IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1
amul=a(1,2)
a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3
amul=a(3,2)
a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2) END IF
IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1
amul=a(1,3)
b(1)=b(1)-amul*b(3) !li=2
amul=a(2,3)
Parameter Dependent Code & Unrolling
per un dato set di parametri
(di input), riesco ad eliminare
ogni loop, ottimizzando cache
e pipe di esecuzione
Debugging (post mortem)
program hello_bug
real(kind=8) :: a( 10 ) call clearv( a, 10000 ) print *, SUM( a )
end program
subroutine clearv( a, n) real(kind=8) :: a( * ) integer :: n
integer :: i do i = 1, n a( n ) = 0.0 end do
gfortran –g hello_bug.f90
Remove core size limit
ulimit –c unlimited ./a.out
Segmentation fault (core dumped) gdb ./a.out core
Debugging (in vivo)
gfortran -g hello_bug.f90 gdb ./a.out
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5) Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /plx/userinternal/acv0/a.out...done.
(gdb) run
Starting program: /plx/userinternal/acv0/a.out
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000 Program received signal SIGSEGV, Segmentation fault.
0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:12 12 a( n ) = 0.0
#include<sys/types.h>
#include<sys/time.h>
double cclock_() {
/* Restituisce il valore del CLOCK di sistema in secondi */
struct timeval tmp;
double sec;
gettimeofday( &tmp, (struct timezone *)0 );
sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0;
return sec;
} program mat_mul
integer, parameter :: n = 100 real*8 :: a(n,n), b(n,n), c(n,n) real*8 :: t1, t2
real*8 :: cclock external cclock a = 1.0d0
b = 1.0d0 t1 = cclock()
call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n ) t2 = cclock()
write(*,*) SUM(c), t2-t1
1) gcc –c cclock.c
Link
Fortran and C
Link Fortran and C
program rand
real(kind=8) :: a external crand call crand( a )
print *,'this is random ', a end program
#include<stdlib.h>
#include<time.h>
void crand_( double * x ) {
(*x) = ( (double)random()/(double)RAND_MAX );
}
Link a C subroutine with a Fortran program
rand.f90 crand.c
Fortran passes arguments by reference, C passes them by value
Link a Fortran subroutine with a C program
#include<stdio.h>
int main() {
int n;
double a[10], d;
n = 10;
d = 1.0;
setv_( a, &d, &n );
printf("%lf\n", a[0]);
}
subroutine setv( a, d, n ) real(kind=8) :: a( * ) real(kind=8) :: d
integer :: n integer :: i do i = 1, n a( i ) = d end do
end subroutine
cvec.c vset.f90
Link Fortran and C
Make Command
If a code is large and/or it shares subroutines with other codes, it is useful to split the source in many files that could be placed in different directories.
In F90 there are dependencies among program units,
i.e. modules must be compiled before than any other program units.
Therefore there is a well defined order for compiling source files
To avoid compiling by hands the sources in the proper order,
the make command could be used
Make Command
The make command can be programmed to do the job for you using a file containing instruction and dircetive.
By default the make command looks in the present directory
for a file colled Makefile or makefile
A simple makefile
# this is a comment within the makefile myprog.x : modules.o main.o
f90 –o myprog.x modules.o main.o modules.o : modules.f90
f90 –c modules.f90 main.o : modules.o main.f90
f90 –c main.f90
this tell to the make command that myprog.x depend from modules.o and main.o
make execute the command only when modules.o and main.o have been built
to compile the code, from the console the programmer issue the command:
> make
A less simple makefile
# this is a comment within the makefile myprog.x : modules.o main.o
f90 –o myprog.x modules.o main.o main.o : modules.o
.f90.o
f90 –c $<
this is an implicit dependency, it state that all files “.o” depend and should be generated from the corresponding “.f90” files
this is a make macro, and it is expandend with the proper
“.f90” filename
In the above example, make try to built myprog.x but it realizes that main.o and modules.o should be