Optimization techniques

(1)

Optimization techniques

Carlo Cavazzoni, HPC department, CINECA

(2)

Modern node architecture

CPU

RAM Disk

cache I, D

Small & fast

(3)

Cache

Hierarchy register  L1  L2  L3  RAM L1: Instruction and data

Size: L1  …  Ln Speed: L1  …  Ln

CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss) and looks in L2 …

cache miss  penaly in terms of clock cycle

(4)

CACHE Direct Mapped

32 Kbyte

32 Kbyte 32 Kbyte

0 32 K

64 K

128 K

cache

(5)

Cache set associative

32 Kbyte

32 Kbyte 16 Kbyte

32 Kbyte

0 32 K

64 K

128 K 16 Kbyte

16 K

48 K LastRecentlyUse

d

Round Robin

Random

(6)

Loop optimization

(7)

Loop fusion

Locality in time

do i=1, n

a(i) = b(i) + 1.0 enddo

do i=2, n

c(i) = sqrt(a(i-1)) enddo

do i=2, n

a(i) = b(i) + 1.0 c(i) = sqrt(a(i-1)) enddo

a(1) = b(1) + 1.0 if n is big enough, a is

loaded, offloaded and loaded again into cache

Reuse the a(i) loaded into

cache

(8)

Loop interchange

Locality in space

do i=1, n do j=1, n

a(i,j) = b(i,j) + 1.0 enddo

enddo

Load elements into cache lines and use only one before replacing them with new elements

Load elements into cache and use all of them before replacing them with new elements

do j=1, n do i=1, n

a(i,j) = b(i,j) + 1.0 enddo

enddo

0x000x01 0x02 0x03

a b

j j

i

(9)

Cache thrashing

real, dimension (1024) :: a,b COMMON /my_com/ a, b

do i=1, 1024

a(i) = b(i) + 1.0 enddo

offset  shift matrixes w.r.t. cache  no more problems

integer offset =

(linea_cache)/SIZE(REAL)

real, dimension (1024+offset) :: a,b COMMON /my_com/ a, b

do i=1, 1024

a(i) = b(i) + 1.0 enddo

Padding

size cache = 4*1024, direct mapped, a, b contiguous  cache thrashing

array size = multiple of cache size 

(10)

Loop unrolling

do j=1, n

do i=1, (n-1)

a(i,j)= b(i,j)+b(i+1,j)+1.0 enddo

enddo

Equivalent Loops.

Fewer jump.

Fewer dependencies.

Fill pipelines and vector units.

do j=1, n

do i=1, (n-1), 2

a(i,j) = b(i,j) +b(i+1,j)+1.0 a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0 enddo

enddo

(11)

Optimize with

numerical libraries

Less coding

Tested and (almost) bug free Standard

Efficient implementation

Optimized

(12)

BLAS

Basic Linear Algebra Subprogram, Parallel BLAS and Basic Linear Algebra Communication Subsystem (www.netlib.org)

• Level 1 BLAS: Vector-Vector operations (scalar only).

• Level 2 BLAS, PBLAS: Vector-Matrix operations (scalar and parallel).

• Level 3 BLAS, PBLAS: Matrix-Matrix operation (scalar and parallel).

• Level 1 and 2 BLACS: vector reduction,

vector and matrics communications.

(13)

Lapack and Scalapack

Linear Algebra Package and Scalable Lapack (www.netlib.org)

 Matrix decomposition.

 Solution of Linear Systems.

 Eigenvalues and Eigenvetors

 Linear Least Square solutions

(14)

MKL ESSL ACML

CUBLAS

MAGMA

PLASMA

(15)

MASS (IBM)

• Accelerated version of SQRT, SIN, COS, EXP, LOG, ecc…

 Scalar and vector

(16)

VML

Equivalent to MASS (vector version only) For Intel processors

Accelerated version of:

sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y

(17)

VML

do i = 1, n

r = r + sin( a(i) ) end do

call vdsin( n, a, y ) do i = 1, n

r = r + y( i ) end do

CALL vml_subroutine( n, a, y )

(18)

BLAS

Matrix multiplication

DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc)

_lm

**real8 a(lda,), b(ldb,), c(ldc,)**

(19)

Profileing with gprof

Compiler flag “-pg” or “-p” (depend on the compiler)

gcc -pg –c mio.c ./a.out

gmon.out gprof

(20)

gcc -pg -funroll-loops –O2 dotprod.c -static

[cineca@rfxoff1 Carlo]$ ./a.out d = 1000000.000000

 % cumulative self self total

 time seconds seconds calls us/call us/call name

 68.57 0.05 0.05 2 23437.50 23437.50 set_vector

 31.43 0.07 0.02 1 21484.38 21484.38 dot_product

 0.00 0.07 0.00 1 0.00 68359.38 main

gprof

(21)

Profileing “by hand”

Find “hot spot” in your application Use temporization functions

CALL SYSTEM_CLOCK(iclk1, count_rate=nclk) CALL critical_subroutine( …… )

CALL SYSTEM_CLOCK(iclk2)

PRINT *,REAL(iclk2-iclk1)/nclk

t1 = cclock()

CALL critical_subroutine( …… ) t2 = cclock()

PRINT *, (t2-t1) CALL CPU_TIME( t3 )

CALL critical_subroutine( …… ) CALL CPU_TIME( t4 )

PRINT *, (t4-t3)

(22)

Mesure performances

#include<stdio.h>

#include<time.h>

#include<ctype.h>

#include<sys/types.h>

#include<sys/time.h>

double cclock_() {

/* Restituisce il valore del CLOCK di sistema in secondi */

struct timeval tmp;

double sec;

gettimeofday( &tmp, (struct timezone *)0 );

sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0;

return sec;

}

(23)

PROGRAM test_dgemm IMPLICIT NONE

INTEGER, PARAMETER :: dim = 1000

REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k

REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock

ALLOCATE( x( dim, dim ), y( dim, dim ) ) ALLOCATE( z( dim, dim ) )

y = 1.0d0

z = 1.0d0 / DBLE( dim ) x = 0.0d0

t1 = cclock( ) do j = 1, dim do i = 1, dim do k = 1, dim

x(i,j) = x(i,j) + y(i,k) * z(k,j) end do

end do end do

t2 = cclock()

write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1

PROGRAM test_dgemm IMPLICIT NONE

INTEGER, PARAMETER :: dim = 1000

REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k

REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock

ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) ) y = 1.0d0

z = 1.0d0 / DBLE( dim ) x = 0.0d0

t1 = cclock()

! x = matmul( y, z )

call dgemm('N', 'N', dim, dim, dim, 1.0d0, y, c dim, z, dim,0.0d0, x, dim)

t2 = cclock()

write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1

(24)

ATLAS

BLAS compatible

Automatically Tuned Linear Algebra Software http://sourceforge.net/

http://math-atlas.sourceforge.net/devel/

(25)

http://www.fftw.org Fast Fourier Trasform

FFT complex to complex FFT complex to real

(26)

Advanced techniques

(27)

do i=1,n do j=1,m

y(j,i) = x(i,j) enddo

enddo

Case Study: matrix transposition

y

x

(28)

Suppose 2-way set associative

y x

For each value of x I need to load into cache a whole line.

1) X “allocate” the 2nd way.

2) Risk of thrashing

3) When the cache is full, the proc. Start to overwrite cache lines

data mapped in cache y “allocate” the 1st way

What happens with the cache

(29)

y  x

As before for each value of x I need to load into cache a whole cache line.

data mapped in cache y “allocate” the 1st way

What happens with the cache, cont.

(30)

y x

Block Algorithm

Load a block of data into cache

Swap data in cache Write data back to memory

(31)

do i=1,n do j=1,m

y(j,i) = x(i,j) enddo

enddo

do ib = 1, nb

ioff = (ib-1) * bsiz do jb = 1, mb

joff = (jb-1) * bsiz do j = 1, bsiz

do i = 1, bsiz

buf(i,j) = x(i+ioff, j+joff) enddo

enddo

do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j)

buf(i,j) = buf(j,i) buf(j,i) = bswp

enddo enddo

do i=1,bsiz do j=1,bsiz

y(j+joff, i+ioff) = buf(j,i) enddo

enddo enddo bsiz = block size

nb = n / bsiz mb = m / bsiz You need to handle:

MOD(n / bsiz) /= 0 OR MOD(m / bsiz) /= 0

Solution: Block algorithm

(32)

Whole block transpose

do ib = 1, nb

ioff = (ib-1) * bsiz do jb = 1, mb

joff = (jb-1) * bsiz do j = 1, bsiz

do i = 1, bsiz

buf(i,j) = x(i+ioff,j+joff) enddo

enddo

do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo

enddo

do i=1,bsiz do j=1,bsiz

y(j+joff,i+ioff) = buf(j,i) enddo

enddo enddo

enddo

IF( min( 1, MOD(n,bsiz) ) .GT. 0 ) THEN ioff = nb * bsiz

do jb = 1, mb

joff = (jb-1) * bsiz do j = 1, bsiz

do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo

enddo

do i = 1, MIN(bsiz, n-ioff) do j = 1, bsiz

y(j+joff,i+ioff) = buf(i,j) enddo

enddo enddo

END IF

IF( MIN(1, MOD(m, bsiz)) .GT. 0 ) THEN joff = mb * bsiz

do ib = 1, nb

ioff = (ib-1) * bsiz

do j = 1, MIN(bsiz, m-joff) do i = 1, bsiz

buf(i,j) = x(i+ioff, j+joff) enddo

enddo

do i = 1, bsiz

do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo

enddo enddo END IF

IF( MIN(1,MOD(n,bsiz)).GT.0 .AND. &

& MIN(1,MOD(m,bsiz)).GT.0 ) THEN joff = mb * bsiz

ioff = nb * bsiz

do j = 1, MIN(bsiz, m-joff) do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo

enddo

do i = 1, MIN(bsiz, n-ioff) do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo

enddo

1

2

3

(33)

Performance tuning and analysis: user codes

Matrix Trasposition Matrix size: 2048x2048

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

0 20 40 60 80 100 120

execution time

Straightforward implementation

Block

implementation

(34)

DO l=1,nphase

IF(au1(l,l) /= 0.D0) THEN lp1=l+1

div=1.D0/au1(l,l) DO lj=lp1,nphase

au1(l,lj)=au1(l,lj)*div END DO

bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase

au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO

bu1(li)=bu1(li)-amul*bu1(l) END DO

END IF END DO

IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0

!li=2

amul = a(2,1)

a(2,2) = a(2,2) - amul * a(1,2) a(2,3) = a(2,3) - amul * a(1,3) b(2) = b(2) - amul * b(1) !li=3

amul = a(3,1)

a(3,2) = a(3,2) - amul * a(1,2) a(3,3) = a(3,3) - amul * a(1,3) b(3) = b(3) - amul * b(1) END IF

IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1

amul=a(1,2)

a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3

amul=a(3,2)

a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2) END IF

IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1

amul=a(1,3)

b(1)=b(1)-amul*b(3) !li=2

amul=a(2,3)

Parameter Dependent Code & Unrolling

per un dato set di parametri

(di input), riesco ad eliminare

ogni loop, ottimizzando cache

e pipe di esecuzione

(35)

Debugging (post mortem)

program hello_bug

real(kind=8) :: a( 10 ) call clearv( a, 10000 ) print *, SUM( a )

end program

subroutine clearv( a, n) real(kind=8) :: a( * ) integer :: n

integer :: i do i = 1, n a( n ) = 0.0 end do

gfortran –g hello_bug.f90

Remove core size limit

ulimit –c unlimited ./a.out

Segmentation fault (core dumped) gdb ./a.out core

(36)

Debugging (in vivo)

gfortran -g hello_bug.f90 gdb ./a.out

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-redhat-linux-gnu".

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>...

Reading symbols from /plx/userinternal/acv0/a.out...done.

(gdb) run

Starting program: /plx/userinternal/acv0/a.out

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000 Program received signal SIGSEGV, Segmentation fault.

0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:12 12 a( n ) = 0.0

(37)

#include<sys/types.h>

#include<sys/time.h>

double cclock_() {

/* Restituisce il valore del CLOCK di sistema in secondi */

struct timeval tmp;

double sec;

gettimeofday( &tmp, (struct timezone *)0 );

sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0;

return sec;

} program mat_mul

integer, parameter :: n = 100 real*8 :: a(n,n), b(n,n), c(n,n) real*8 :: t1, t2

real*8 :: cclock external cclock a = 1.0d0

b = 1.0d0 t1 = cclock()

call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n ) t2 = cclock()

write(*,*) SUM(c), t2-t1

1) gcc –c cclock.c

Link

Fortran and C

(38)

Link Fortran and C

program rand

real(kind=8) :: a external crand call crand( a )

print *,'this is random ', a end program

#include<stdlib.h>

#include<time.h>

void crand_( double * x ) {

(*x) = ( (double)random()/(double)RAND_MAX );

}

Link a C subroutine with a Fortran program

rand.f90 crand.c

Fortran passes arguments by reference, C passes them by value

(39)

Link a Fortran subroutine with a C program

#include<stdio.h>

int main() {

int n;

double a[10], d;

n = 10;

d = 1.0;

setv_( a, &d, &n );

printf("%lf\n", a[0]);

}

subroutine setv( a, d, n ) real(kind=8) :: a( * ) real(kind=8) :: d

integer :: n integer :: i do i = 1, n a( i ) = d end do

end subroutine

cvec.c vset.f90

Link Fortran and C

(40)

Make Command

If a code is large and/or it shares subroutines with other codes, it is useful to split the source in many files that could be placed in different directories.

In F90 there are dependencies among program units,

i.e. modules must be compiled before than any other program units.

Therefore there is a well defined order for compiling source files

To avoid compiling by hands the sources in the proper order,

the make command could be used

(41)

Make Command

The make command can be programmed to do the job for you using a file containing instruction and dircetive.

By default the make command looks in the present directory

for a file colled Makefile or makefile

(42)

A simple makefile

# this is a comment within the makefile myprog.x : modules.o main.o

f90 –o myprog.x modules.o main.o modules.o : modules.f90

f90 –c modules.f90 main.o : modules.o main.f90

f90 –c main.f90

this tell to the make command that myprog.x depend from modules.o and main.o

make execute the command only when modules.o and main.o have been built

to compile the code, from the console the programmer issue the command:

> make

(43)

A less simple makefile

# this is a comment within the makefile myprog.x : modules.o main.o

f90 –o myprog.x modules.o main.o main.o : modules.o

.f90.o

f90 –c $<

this is an implicit dependency, it state that all files “.o” depend and should be generated from the corresponding “.f90” files

this is a make macro, and it is expandend with the proper

“.f90” filename

In the above example, make try to built myprog.x but it realizes that main.o and modules.o should be