CRS4CeefAdva edS dieReea hadDeve

(1)

Centre for Advan ed Studies,

Resear h and Development in Sardinia

Uta - (CA)

Programma ENEA-MURST

Obiettivo 7

Implementazione del odi e Ares su ar hitetture

parallele

by:

S. Chibbaro and M. Mulas

Computational Fluid Dynami s Area

(2)

(3)

This report des ribesthe MPIparallelization ofthe ombustion odeAres. First,

the stru ture of the ode is des ribed, then a detailed analysis of the algorithms

used to integrate the ow equation is arried out, with parti ular emphasis to the

parallelizaztion issues. An analysis of few possible alternative approa hes is then

followed by the des ription of the hosen approa h. A preliminary validation of the

parallel ode ompletes the report. Thourough validation on 3D omplex test ases

will be presented in the next report.

(4)

1 Stru ture of the ode 2

1.1 The Multi-Blo k Stru ture . . . 2

1.2 Dataallo ation . . . 5

1.3 Boundary onditions. . . 7

1.4 TimeIntegration . . . 9

2 Code des ription 11 2.1 Initialization . . . 11

2.2 The Solver . . . 12

2.3 Integration Module . . . 13

2.4 fsvel subroutine . . . 16

2.5 fs g subroutine . . . 18

2.5.1 CGS solver . . . 19

3 Analysis of parallelism 24 3.1 Approa h #1 . . . 24

3.2 Approa h #2 . . . 24

3.3 Approa h #3 . . . 25

4 Code Parallelization 26 4.1 Initialization . . . 26

4.2 Solver . . . 28

4.3 Convergen e . . . 30

5 Validation 31 A Input and Output les 33 A.1 The dat le . . . 33

A.2 The b le . . . 33

A.3 The msh le . . . 34

A.4 The on le . . . 34

A.5 Restart simulations . . . 34

A.6 The Output Files . . . 34

Referen es 36

(5)

1 Stru ture of the ode

Ares ode belongs to a family of odes developed in the past years by the group of

Fluid Dynami s of CRS4 (see referen es [1℄ and [2℄ for instan e). It an be divided

in three main parts.

INIT:all variables are initialized and dataare allo ated.

FSSOLVER: residuals are omputed and all equations are advan ed in time

with an impli it fra tional step method.

OUTPUT: all the output elds are written.

This reportbegins with the des ription ofthe global stru ture of the ode. Then

the multi-blo kstru ture, the boundary onditions odi ation methodandthe data

allo ation are illustrated.

1.1 The Multi-Blo k Stru ture

In ell- entered Finite-Volume odes the boundary ondition treatment is done by

means of a layer of ghost ells surrounding the omputational domain. In ase of

multi-blo k meshes, time-advan ing iterations should be performed on ea h blo k

independently and, eventually, in parallel, as shown in gure 1:

COMPUTATION LOOP OVER BLOCKS do ibloc=1,nbloc

END LOOP OVER BLOCKS end do

Figure 1: The blo k loop stru ture

Two layers of ghost ells are then used to store both the geometri al deni-

tion (volumes and surfa e normals) and eld information (values of the dependent

(6)

Figure 2 shows how the domain splitting pro ess onsists of onstru ting the

appropriate ghost ells for ea h sub-blo k, whi h in turn means dupli ating the in-

formation onboth sides of the interfa e ut.

Splitting

2 layers of ghost cells Interior domain

Figure 2: Domain splitting

IfNI, NJ andNK represent thenumberofgrid nodes ofastru turedmesh,then

all of the eld arrays within a given blo k are dimensioned in the following way:

ARRAY(-1:NI+1,-1:NJ+1, -1: NK+ 1)

nomatterwhether theyrepresent ell- enter values(su h asvolumes, dependent

variables, time-steps, ...), node values (su h as grid point oordinates), or values of

physi aland geometri alentitieslo ated atthe ells'interfa es (su has for instan e

surfa e normals and uxes). Figure 3 shows, for the sake of simpli ity, a 2D grid

made of NI NJ nodes. Internal ells range from index (1;1) (lower-left orner),

to index (NI 1;NJ 1) (upper-right orner), while node- entered entities arrays

range from index (1;1) (lower-left orner), to index (NI;NJ) (upper-right orner).

cell center (ni-1,nj-1)

cell center (1,1) grid point (1,1) grid point (0,0) cell center (-1,-1)

cell center (ni+1,nj+1) grid point (ni+1,nj+1)

grid point (ni,nj)

Figure 3: Grid nomen lature

(7)

Figure 4 shows how the dierent blo k fa e are identied for the appli ation of

boundary onditions. For generality and exibility purposes, ea h fa e an be subdi-

videdinvariousportions(segments),asshowning. 5,ea hsegment orresponding

to a dierent boundary treatment.

i

face 6 face 4

face 1 face 2

face 3 face 5

j k

Figure 4: Fa e nomen lature

Figure 5: Fa e splitting

(8)

1.2 Data allo ation

A one dimensional double pre ision array work is used to store all data: as shownin

gure6 the rstpartof work isreserved for permanentdata, whereas ase ond part

is used to dynami ally store temporary data.

block 1

block 2 1

NLAST

NMAX Permanent

Temporary Storage

Storage

block NBLOC

stack

Figure 6: The work array stru ture

The work array is dimensioned in the main program as:

DOUBLE PRECISION WORK(NMAX)

where NMAX is a parameter whi h an be modied by the user depending on the

size of the test ase. Figure 6 shows the array stru ture: memory is allo ated for

all permanent dataof all blo ks (the total numberis NBLOC). NLAST represents

the last array lo ation allo ated for permanent data. The spa e left for temporary

storage is then made by (NMAX NLAST) lo ations.

The temporary storage is managed as asta k with two utility routines whi h a t

as the C-fun tions pop and push.

Ea h blo k area, shown in gure 7,is organized in the same way. The data an

be a essed by means of arrays of pointers, all of them dimensioned as

(9)

iodiap(ibloc)

(p,u,v,w,f ...f,k ,ε) Dependent variables

idepvp(ibloc)

idepop(ibloc)

idisp(ibloc)

imup(ibloc) idiap(ibloc)

ivolp(ibloc) icoop(ibloc)

inorp(ibloc)

illp(ibloc) irrr(ibloc)

Dependent variables (p,u,v,w,f ...f,k,ε)

Old

Distance from the closest surface Density (ρ)

Turbulent viscosity Diagonal pressure correction Old diagonal pressure correction

Volumes Coordinates (x,y,z)

Face-normals (NIx,NIy, NIz, (NJx,NJy, (NJz,

(NKy,

(NKx, (NKz)

Cell reference lenght Time steps

(Spal.-Allmaras)

idtp(ibloc) ncjet ivcomp

jet cells

Injet boundary field

Figure 7: Details of the work array stru ture

where NBLOC MAX is user denedparameter whi h identiesthe maximumnum-

berof blo ksallowed. All pointer arrays are storedin the COMMONblo k,together

with the following ones:

NNI(NBLOC MAX)

NNJ(NBLOC MAX)

NNK(NBLOC MAX)

NCELL(NBLOC MAX)

whi h store the dimensions of the mesh (NI;NJ and NK) and the total number of

ells (whi h in ludes the internal plus two layers of ghost ells) of the given blo k

(NCELL=(NI+3)(NJ+3)(NK+3)).

The table 1 summarizes all dened arrays of pointers: for ea h one a very brief

(10)

(in the general ase of turbulent ombustion with four s alars); the size is to be

intended in unit of NCELL.

The storage ofold dependentvariables isneeded for the time advan ingsolution

method.

Name des ription size

IDEPVP dependent variables 10

IDEPOP old dependent variables 10

IDISP distan e from solid walls 1

1

IRRR density 1

IMUP turbulent vis osity 1

IDIAP pressure orre tion 1

IODIAP oldpressure orre tion 1

IVOLP volumes 1

ICOOP oordinates 3

INORP normals 9

ILLP ell referen e length 3

IDTP time step 1

Table 1: List of arrays of pointers.

Besides the pointers topermanentmemorylo ations, in theCOMMONblo ksix

pointers are stored. Theyare usedastemporary, andsopoped and pushedwhenever

itisneeded. Generallythesearraysarenotstati allystoredinCOMMONbutde lared

in the subroutines where they are used. These six arrays are stored in COMMON

be ause theyare used in the solver, at ea h time step, in many subroutines; so it is

usefulto havethemstati ally stored;the ee t isthatjust insidethe solver theyare

treated as permanent pointers. They are imat, irhs, ix g, that are used in solving a

linear system;ilpre , idpre , iupre that are used to pre ondition the matrixes ofthe

linear system, whenever a pre onditioner is used.

1.3 Boundary onditions

In ARES ode anumber of hoi es are available to imposephysi al boundary ondi-

tions.

In aseofmulti-blo kmesheswheredierentblo ksare onne ted, itisne essary

to ex hange data between them. Two layers of ghost ells surrounding the blo k

omputational domain are then used to store eld quantities.

The data ex hange is s hemati ally shown in gure 8. After the data ex hange,

(11)

1

2

2 ghost cells

Figure 8: Data ex hange

and external data, in su h a way the uxes are omputed as ifthe blo k subdivision

didn't exist.

The meshes used are stru tured and onformal. This implies that all the points

are dened by three indi es I; J; K. In subroutine readb the parameters used for

the boundary onditionssetting are read. First, 5indi es are read for allthe sides of

ea h blo k:

itype, ist1,ien1, ist2,ien2.

They are 5 integer whi h represent the type of BC to be applied (in ow, out-

ow, onne tion), the value of the rst and of the last node of the two indi es

whi h des ribe the side onsidered. For example for a side dened by the equation

I = 1 the indi es J and K an hange, and the parameters ould be, for instan e,

ist1 = 1;ien1 = nj;ist2 = 1;ien2 = nk. An important feature of ARES ode is

the generality of the orientation of ea h blo k with respe t to the others, the only

limitation is thatonly stru tured and onformal grids are permitted.

In the ase the "itype" parameter identies a blo k onne tion, a se ond line of

parameters is also read. These are:

i si, i s1, in 1, i s2, in 2.

They are 5 integers whi h indi ate the side onne ted, the value of the indi es

at the rst node and their in rements (+ or - 1).

All theseparameters are stored in integer pointers in the COMMONblo k.

In subroutine dum oo a layer of ghost nodes is built up. So two layers of ghost

ells are formed. In blo k- onne tions the value of the mesh at ea h node is lled

with anex hange ofdataas showningure8. Afterthe ghost ells onstru tion, all

the indi es of the boundary onditions are stored with new values. In fa t they are

written now for ell entered boundary and in luding the ghost ells. This is made

(12)

In the onne tions among blo ks it is basi to load and unload the array to be

ex hanged with theright order, be ause of the generality of the orientationsof ea h

blo k.

Thisispossible knowingtheintegerparametersreadinreadb . Theseparameters

des ribe how the side iside of the blo k iblo is onne ted with the side i si of the

blo k i bl. A simply algorithm indi ates also the orientation of the blo k onne ted

and permits the loading of the array to be ex hanged with the right order, it is:

in 3 = 1

iss = i si +iside

iss2 = (iss=2)2

if(iss2:eq:iss) in 3 = 1

in = max(in 1in 2in 3;0)

(1)

This pro edure xes the value of in, to be 0 or 1. If it is 0 the orientation of

the blo ks is su h that the order of the indi es must be inverted with respe t of the

natural order.

1.4 Time Integration

Equations for velo ity, pressure and eventually s alars and turbulent elds are ad-

van ed in time by using an impli it integration method. The general form of the

dis retized systemof partial dierential equations to be solved an be writtenas:

Q

t

=Res

n+1

(2)

Byusinganimpli its heme, the righthand sideofsystem2 atthetemporalstep

n+1 an be obtained by its linearization:

Res

n+1

=Res

n +

Res

Q

!

n

Q (3)

so that in order to advan e the eld from the time n to the next time n+1

equation 2 an be rewritten as:

Q=Res

n

t +

Res

Q

!

Qt (4)

(13)

By dening the Ja obian matrix J

n as J

n

=

R es

Q

n

equation 4 be omes:

I

t J

n

!

Q=Res

n

(5)

whi h has the general matrixform:

Mx=rhs: (6)

where M is the matrix of the oeÆ ients and x the unknown ve tor.

The right hand side of system6 ontains the ell residuals of the variables to be

solved. The formof these uxes depends onthe spatialdis retization s heme. With

impli it time advan ing method it is ne essary to solve the linear system 6 at ea h

timestep. In ARES ode there are two algorithms available: CGSand BI-CGSTAB.

They are two standard libraries implemented in the ode via publi domain libraries

template. Theysolve the linear system6 through aniterative pro edure. Theyhave

the same stru ture and in the following analysis both of them will be alled simply

CGS solver for the sake of simpli ity. If requested a pre onditioner is joined to the

CGS solver.

The equation ofmotion, ompressible Navier-Stokes,aresolved with afra tional

step method. In this method, rst a velo ity eld is omputed (that does not ne -

essarily respe t the steady ontinuity equation); then the orre t pressure eld is

omputed; later using the value of pressure eld the velo ity is orre ted to enfor e

the ontinuity onstraint.

(14)

2 Code des ription

2.1 Initialization

In the main program init is alled.

open output files

calculate distance from

wall (SA) geometry

set up read mesh

readbc

alloc call call read input files

store

set up data structure B.C. information

Figure 9: The init stru ture

In this subroutine all the initialization operations are made following the s heme

of gure 9.

All the input les are read.

The subroutine readb is alled; here all the information about boundary ondi-

tions are read and stored.

The subroutine allo is alled, where the data stru ture des ribed in paragraph

1.2 is set up.

The layer of ghost nodes is built up. All the geometri al ell variables (volumes

and normals) are al ulated. In the ase of a turbulent simulation with Spalart-

Allmaras model the distan e from the nearest wall for ea h ell is omputed.

Finally, all the output le are opened.

(15)

2.2 The Solver

On etheinitialization isnishedthesubroutine fssolveris alled frommainprogram.

The solver is the ore of the whole program and represents the time integration

module.

advance in time variables

Set BC + TURB

Store old values

Compute time step

Set BC + TURB Convergence

check fststep

fsscal setbc fsmutur turbsbc fskeps fsvel fsmutur turbsbc setbc

fsdensity

fspressure LOOP

ITERATION TIME

Figure 10: The solver stru ture

With referen e to gure 10 atea h iteration step the following is performed:

1. the subroutine setb and sab or turbsb , respe tively if Spalart-Allmaras or

turbulentmodelisused, are alled;soboundary onditionsare appliedfor

all the dependent variables. In turbulent owsthe subroutine fsmutur is alled

for omputing turbulent vis osity.

2. values of dependent variables and of the pressure orre tion at previous time

steparestored;the dependentvariables,asshownin paragraph 1.2,are P,that

is the pressure;U,V,W, whi h are the three velo ity omponents; f

i

, whi h are

(16)

3. Time step is omputed. It is al ulated on a lo al ell-basis, on e the global

CLF number is provided as an input information ( fl = a t

x

1).

4. All the dependent variable elds are advan ed in time. The subroutine fsvel is

alled thatadvan es the velo ity.

The subroutine fspressure is alled to advan e the pressure and orre t the

velo ity.

The subroutine fskeps is alled to advan e the turbulent eld following the

hosen turbulent model.

The subroutine fss al is alled to ompute the s alar elds

The subroutine fsdensity is also alled. The density is al ulated through the

equation of state.

The subroutine fsmuturis alled, if the al ulation is turbulent, for al ulating

the new value of the turbulent vis osity eld.

5. Boundary onditions are applied to all dependent elds again.

6. Che k on onvergen e level is made in order to test whether a prexed value

has been rea hed.

Blo k-loops are performed inside ea h module of routine fssolver. From a paral-

lelization easiness pointofview, thefa tof havingaloopover theblo ks insideea h

oneof the previously enumeratedmodules isquite appealing, be ause ittswith the

natural blo k to pro essor approa h.

All the modules but the fourth one perform their omputation independently on

ea hblo kandareintrinsi allyparallel. Communi ationsamongtheblo ksarewhere

boundary onditions are imposed and in module 4. A more detailed des ription of

what is performed at point 4 is presented in the next paragraph for a lari ation.

2.3 Integration Module

Inthe aboveparagraph ithasbeen shown,gure10,thatinorderto advan eintime

the variable elds ve subroutine named fs"variable" are alled fromthe subroutine

fssolver. It has been noted that the time mar hing algorithm is an impli it one and

thatit involvesthe onstru tion and the solution of an algebrai linear system. This

is made in these subroutines.

These subroutines have all the same stru ture, whi h is shown in gure 11. So

only the rst alled, that is fsvel, will be analyzed, being impli it that the same

(17)

linear system system matrix

build up allocate

advancing fields at next time step

deallocate system matrix

solving linear system

over blocks loop over blocks

loop

Figure 11: Stru ture of the generi fs"variable" subroutine. The se ond module is

des ribed in g.13.

Some omments must be done on gure 11. To advan e in time the variable

elds an algebrai linear system of the form6 has to be solved. Atthe beginning of

allof thesesubroutines fs"variable"the matrixM,here alled A,the unknownxand

the right hand side rhs are allo ated in work as temporary arrays, they are pushed.

Asnotedinparagraph 1.2, thepointerswhi hindi atethememorylo ationsofthese

arrays are imat for A, ix g for x and irhs for rhs. They are deallo ated only at the

end of the subroutine, gure 11, so they are treated as permanent arrays during all

the al ulation inside ea h subroutine fs"variable". The dieren e between a real

permanent array, for example that indi ated by pointer idepvp, is thatea h variable

needs a dierent pointer for a permanent array while the same pointers imat, ix g,

irhs are used in the solution of the linear system for ea h variable, that is in all fs

"variable" subroutines. At this levelit is known the sizeof these arrays. Dimensions

of matrixes A, x and rhs are:

A = (7(NI+3)(NJ+3)(NK+3)nvar)

x = ((NI 1)(NJ 1)(NK 1)nvar) (7)

(18)

where NI 1;NJ 1;NK 1 are the total number of internal ells in all the

omputationaldomain, allblo ksin luded; nvar isthenumberofvariables al ulated

and updated in the fs"variable" subroutine, for example they are 3 in fsvel for the

velo ity and1infspressure for thepressure. The matrix orresponding tothepointer

imat is not alled M but A be ause of its dimensions. In the linear system 6 the

matrix M must have a number of olumns equal to the dimensions of the unknown

x. Instead the memory allo ation lo alized by the pointer imat for the matrix A is

larger, for both internal and ghost ells are pla ed in the olumns.

B1 B2

BN X

B1 B2

BN

RHS B1

B2

BN

M

work(ixcg)

work(irhs) work(imat)

Figure 12: Matrixes of the linear system

Spe ial attention has to be put on how the matrixA, the unknown ve tor xand

rhs of system6are stored. The riterion used to store thesearrays is dierent from

(19)

as explained in gure 12.

The matrixes are evaluated inside a loop over blo ks. At the end of the loop

all matrixes , even if omputed independently, are stored all together. So all the

data on erning thelinear systemare stored inwork. The fa tthatthese arrays are

onstru ted inside a loop over blo ks is re e ted by their form, as shown in gure

12, they are \blo k-matrixes".

At this level there is no ommuni ation among the blo ks. At this point the

overall system6 has to be solved, where now A is a blo kmatrix.

In gure 11 it has been underlined whi h parts of the subroutines fs"variable"

are inside a loop over blo ks. They on ern the onstru tion of the linear system

and the updating of the variable elds, whi h takes pla e after the solution of the

linear system. This is important be ause the al ulation inside a loop over blo ks

is distributed among them and is omputed independently by ea h one, in total

agreement with the one-blo k-one-pro essor approa h. The module on erning the

solution of the linear system is outside from the loop. This means that not all the

omputation is performed independently by ea hblo k.

In the nextparagraph the dierent modules ofgure 11will be "opened" forthe

analysis of the fsvel subroutine, as example for all the fs"variable" subroutines, in

order to better omprehend the ARES stru ture and parti ularly its parallelization

properties.

2.4 fsvel subroutine

With referen e to gure 11, the rst module ontains the arrays allo ation.

In the se ond module the linear systemis built up. The ow hart of this module

is shown in gure 13. It has been put in eviden e in gure thatthe whole module is

inside a loop over the blo ks.

With referen e to equation 6 it is ne essary to al ulate the matrix M and the

right hand side. First the matrix M is al ulated. This is done in three parts, as

shown in gure 13. The Ja obian matrix is al ulated sequentially for the invis id

terms(routine fsja inv), the vis ous terms(routine fsja lap) andthe unsteady term.

These terms are stored in the array A (the pointer imat).

Itisremarkabletonotethatthisstru tureissharedbyallthesubroutinefs"variable".

The dashed line logi blo k in gure 13 is exe uted for the momentum equation

only (fsvel). Here the pressure-velo ity orrelation is savedfor the future use by the

pressure orre tion s heme.

On e the onstru tion of matrix A is nished, the right hand side rhs is al u-

lated. It ontains the ell residuals of the variables to be solved. As in the ase

of matrix A, rst the invis id terms are omputed (subroutine fs xinv), then the

(20)

LOOP OVER BLOCKS Compute the

inviscid part of Jacobian matrix Add the jacobian for the viscous fluxes

Add the unsteady term

Save pressure- velocity gradient correlation for the projection method

Compute inviscid and viscous fluxes

system RHS and compute residual norm

Build-up the fsjaclap

fsflxvis fsflxinv fsjacinv

(fsjaclaprng)

Figure 13: Impli it solver stru ture: rst loop over the blo ks

use of the ghost ells. The operations are made with the help of a temporary array

work(i x) This array is pushed with the right dimensions:

(NI+3)(NJ+3)(NK+3)nvar:

When all the operations are made the internal ells of work(i x) are opied into

the array rhs and the temporary array work(i x) is poped. This is made in the last

module of gure 13.

In thismodulethe maximumresidualvalueis nally al ulatedformonitoring the

onvergen e. At this point the linear system is built up and the loop over blo ks

ends.

The linear system is solved in the se ond part of the impli it solver, the third

module of gure 11. The ow hart is represented in gure 14.

It has to be noted from gure 14 that the old values of the variables to be

omputed are rstly saved into a temporary array (subroutine fsf2t). This is done

be ause their permanent allo ation in the overall working array work is used inside

(21)

Save the generic field into a temporary array

restore the temporary array into the permanent one

solver call the linear

fsf2t

fscg

fst2f

Figure 14: Impli it solver stru ture: all to linear solver

2.5 fs g subroutine

The stru ture of subroutine fs g is reported in gure15. In ARES ode a onjugate

gradient squared algorithm is used for solving the algebrai linear system. This

method may require a pre onditioning of system 6. The pre onditioning is arried

out multiplying the ve tor x by a ompatible matrix, omposed by a hoi e of A

oeÆ ients. The ne essary matrixes for pre onditioning are allo ated as temporary

in work. There are four hoi es:

1. no pre onditioning

2. diagonal pre onditioning; the pointer idpre is allo ated;

3. ILU pre onditioning; the pointers ilpre and iupre are allo ated;

4. SORpre onditioning; the pointer idpre is allo ated.

In the memory area lo alized by the pointers the oeÆ ients needed for the pre-

onditioning are al ulated andstored. The subroutine fsdiag is alled for adiagonal

pre onditioning, the subroutine fspilud for ILU pre onditioning and the subroutine

fsdiag for SOR pre onditioning.

Then CGS solver is alled. There are two available subroutines: gs and Bi-

gstab. The systemis pre onditioned and solved in the unknown x with an iterative

(22)

preconditioning matrixes check convergence

matrixes

Bi-cgstab cgs the CGS solver

selection of

deallocation

fspilud (ILU) allocation of

preconditioning selection

fspdiag (diagonal or SOR)

Figure 15: fs g stru ture

A he k onthe onvergen eof the solver iterative pro edure ismade atthe end,

if the toleran e requested is not rea hed, the array x is set to zero value.

In the last module of subroutine fsg the temporary arrays allo ated for the pre-

onditioning are deallo ated.

It has been noted thatthe two linear systemsolvers have the same stru ture, so

a generi CGS solver is des ribed. The observations are the same for both ases.

2.5.1 CGS solver

A s hemati ow hart of the module "CGS solver"is given in gure 16.

With referen e to gure 16, rst the right hand side array ontent is opied in a

temporary array for omputing anoverall norm. Both theseoperations are made for

all the (1: ni 1;1:nj 1;1:nk 1;nvar;nblo ) elements of the RHS ve tor.

It is important to note that while the rst step ( opying) an be easily parallelized

without any needing of ommuni ation among pro essors, the se ond one requires

message passing instru tions.

In the third module of gure 16, all the s alar parameters requiring inversion are

(23)

Copy RHS array to a working array Compute a global norm to check for convergence

TEMP(i) = RHS(i)

i=1,2,...,NTOT (# of RHS elements)

1 i NTOT

RHS(i)

2 1/2

Compute global constants needed by the algorithm

set BC’s

Perform a matrix times vector product to invert the system of the system

preconditioning

Solve the linear

system

update the solution vector x

Return YES NOT

convergence check

do ibloc = 1,NBLOC

do ibloc = 1,NBLOC array x

the value of correct

iteation CGS

fspsolve

fsmatvec

Figure 16: CGS solver stru ture

kind of libraries are totally lo al, that is of the form:

A(i)=B(i)+C(i)

where and are parameters xed by the algorithm, and A;B;C are some arrays.

So theyare made independently ell by ell, then they are intrinsi ally parallel.

On e allthesepreliminary operations aremade thelinear system an be inverted.

First the system is pre onditioned, if requested. The subroutine fspsolve is alled.

In gure 16 the rst logi al blo k joined to fspsolve is dashed be ause not all the

pre onditioner need the setting of boundary onditions.

As shown in equations 8, the dimension of the olumns of the matrix A is not

the same of the ve tor x, .

InordertoresolvethisdiÆ ultytheve torxis opied intheareaofworkmemory

dedi atedtothe orrespondingdependentvariableeld;forexampleiftheCGSsolver

(24)

this asethe ve tor xis opied inthe workarea lo alizedbythe pointeridepvp, from

idepvp +NCELL for the rst 3NCELL, that is where the velo ity eld has its

permanent storage, gure 7. The resulting ve tor has a number of points equal to

thepointsinthe olumnsofA, beingdenedthedependentvariablesbothininternal

and in ghost ells. These points insidethe subroutine fspsolveare rea hed by means

of a lo al pointer named i opy. It is simply:

i opy =idepvp(iblo )+ifld n ell(iblo )

where i d is the parameter that indi ates the eld ℄ onsidered, in the example its

value is 1. The matrix operations are made with this array work(i opy) and at the

end updated values are opied on x. The ghost ells are not used in these two kind

of pre onditioning.

Inthe aseofdiagonalandILU pre onditioning thestru tureisexplained ingure

17.

in array work(icopy) copy x

conditioning

copy updated work(icopy)

in x

copyx2f

pdiag (diagonal case) (ILU case)

copyf2x pilu

loop over blocks

loop over blocks of vector work(icopy)

Figure 17: fspsolve stru ture in the ase of diagonal or ILU pre onditioning

Some olumns of the matrix A are stored in the allo ated array indi ated by:

idpre , in the diagonal ase; ilpre and iupre in the ILU ase. The ve tor x is

multiplied and divided by linear ombination of them.

In the ase of SOR pre onditioning the operations are arried out following the

gure 18. The stru ture is very similar. But in this ase the pre onditioner makes

use also of the ghost ells to ondition the internal values; so it is ne essary to ll

the ghost ells elements of the array work(i opy). Itis important to note that only

the internal ellsare used for elds timeupdating; this is why the matrixA, and the

(25)

in array work(icopy) copy x

copyx2f loop over blocks

B.C. setting zerosbc

conditioning

copy updated work(icopy)

in x

copyf2x loop over blocks

psor loop over blocks

of vector work(icopy)

Figure 18: fspsolve stru ture in the aseof SOR pre onditioning

The ghost ells of ve tor work(i opy) are lled by setting boundary onditions like

as for dependent variables, in the subroutine zerosb ;the operation wasexplained in

paragraph 1.3 and gure 8.

Asshowningure16,on ethepre onditioningisdonethealgebrai linearsystem

is solved. Matrix-ve tor produ ts are ne essary for the s ope. These operations

are made in subroutine fsmatve . The produ t is between the matrix A and the

pre onditioned ve tor x.

The subroutine fsmatve follows astru ture of the same kind of fspsolve, gure

19. The opy ofthe ve tor work(i opy)ontheve torx atthe endof thesubroutine

is not done here be ause the ve tor x is un hanged by this subroutine. Matrixes

dimensions must be ompatible, then the array work(i opy) is used with the ghost

ells lled in the subroutine zerosb . The presen e of B.C. setting, as usual,implies

the onne tion among blo ks.

Itmustbenoted thatthe area ofpermanentmemorydedi ated tothe dependent

variables(thatisthe area lo alizedbypointer idepvp, gure7)hasbeenoverwritten,

solving the linear system. So it is basi the operation of saving dependent variables

before alling fs g subroutine and of restoring them after it, in ea h fs"variable"

(26)

in array work(icopy) copy x

copyx2f loop over blocks

B.C. setting zerosbc

**A*x** loop over blocks

Figure 19: fsmatve stru ture

The solver stru ture analysis leads to the on lusion that in the multi-blo k ase

a sort of domain de omposition is made, where the sub-domains are the blo ks. It

is important to note that this stru ture together with the impli it time advan ing

method auses dieren es between the multi-blo k and the one-blo k ase.

When there is a blo k subdivision, a omputational point is in uen ed by all the

other points of the same blo k, and by the two layers of ghost ells, that represent

the rst two layers of ells of the adja ent blo ks, gure 8, and not by the whole

domain as in aone blo k onguration.

(27)

3 Analysis of parallelism

In the last se tion the blo ks onne tion (B.C., global operations) and the blo k-

loop stru tures have been underlined. Only the rst one must be hanged for the

parallelization, be ause the blo k loops are intrinsi ally parallel. Three alternative

hoi es are possible to a hieve the goal of ARES parallelization, namely:

1. to parallelize only the rst part of the ode and not the solver;

2. as the previous point 1) but using parallel publi domain libraries to solve the

linear system;

3. to extend the parallelization also to the loops over the blo k inside the linear

solver module;

Pros and ons of the above possibilities are dis ussed brie yin the following.

3.1 Approa h # 1

Itis the simplest onebut it isalso the less eÆ ient. Substantially, the parallelization

is limited to the building up of the linear system whi h is then solved by a serial

algorithm. Byfollowingthisapproa hthetotalnumberoftheneeded ommuni ation

at ea h time step among the pro essors is 2nvar ntxvol +nvar nt ell 7.

From previous experien es madeon similar odes and be ause of the limitednumber

of the ommuni ations among pro essors, itseems reasonable that the parallel part

of the ode would s ale linearly with the pro essors number. In order to get some

idea of the possible gain in term of speed up of the omputation, a proling of the

s alarversion ofAREShas beenrun. Resultsproofthatalmost50% oftherun time

is spentfor the linear systemsolution, thus in the partof the ode not interested by

the parallelization pro edure. In theideal ase oflinear s alability of the parallel part

and of an innity number of pro essors, the maximums alability fa tor would then

be only2.

3.2 Approa h # 2

A ording to this se ond alternative, the same steps performed at point #1 would

be implemented, plus the use of publi domain parallel linear solver libraries to be

linked somehowtothe existing ode. Inordertoa hievesu hawayofparallelization

somesteps must be performed. Publi domain libraries of gs and bi- gstab parallel

solver have to be found. These libraries are built and optimized using a ompa t

(28)

for the transformation of the stru ture obtained in subroutines fs"variable" (gure

12) in the stru ture requested by the publi domain library.

Problems onne ted tothis se ond hoi e are inthe rstpla ethatsu hsuitable

libraries have to be identied and implemented into the ode and, se ondly, that at

this level, with only an \a priori" analysis of the ode, it not possible to anti ipate

the ode performan es.

3.3 Approa h # 3

Thisapproa hwould onsistinmakingparallelalltheloops overtheblo kswhi hare

usedintothe ode. Thesameserial odestru turewouldbesubstantiallymaintained,

but the orre t ommuni ations have to be ensured among blo ks. This means to

implementamessagepassingstru turethatrepli atestheserialB.C.setting;theyare

present in subroutine fssolver, inside the SOR pre onditioning, subroutine fspsolve,

and in the solution of matrix-ve tor produ ts, subroutine fsmatve . Moreover all

the operations whi h were des ribed as \global" need some ommuni ations among

pro essors. Thisapproa hshould provide again inperforman e, betterthanthe rst

alternative and atleast similarwith respe t tothe latterdis ussed. Atea hiteration

of the linear solver the setting of boundary onditions is alled on e or twi e (in

the ase of SOR pre onditioning) and other global operations are needed. This is

an expensive omputational ost in a parallel al ulation, but sin e fs g represents

about the 50% ofthe omputation,it isexpe ted notto be veryrelevant, aslong as

the blo ks remain large enough (or the ratio surfa e=volume, whi h is a measure

of omuni ation versus omputation, is small).

(29)

4 Code Parallelization

In order to de ide the kind of approa h for the ode parallelization, sometests with

dierent linear solving algorithms were made. The failure of these algorithms and

the stru ture of the ode have suggested to follow the approa h #3. The ode is

parallelized in a blo kto pro essor philosophy with the help ofMPImessage passing

libraries. So the pro essors will exe ute the workmade by ea hblo k inparallel, but

at ea h time step and at ea h linear iteration all the boundary onditions are to be

set and allthe global quantities are al ulated by the use of messagepassing.

In this se tion all the important hanges made in the ode to a hieve the paral-

lelization will be des ribed.

4.1 Initialization

In the main program the MPI initialization is made. It means, simply, to write the

three fortran lines:

CALL MPI INIT(error)

CALL MPI COMM SIZE(MPI COMM WORLD;NPE;error)

CALL MPI COMM RANK(MPI COMM WORLD;MYPE;error)

In init the input data are read from les. Be ause of the blo k to pro essor

approa h every pro essor must perform an independent omputation and so all of

them must read the same data. A number of input les equal to the number of

pro essors is ne essary to an begin the al ulation.

In the subroutine is alled the routine readb . In a parallel ase onne tions

betweenpro essorswillbein luded. Moreinformationisrequestedthanthoseneeded

by a blo k onne tion.

itype des ription b var(1) b var(2) b var(3) b var(4) b var(5)

1-200 Blo k Conne tion i si i s1 i s2 in 1 in 2

400-600 Pe Conne tion i si i s1 i s2 in 1 in 2

itype des ription b var(6) b var(7) b var(8) b var(9)

1-200 Blo k Conne tion - - - -

400-600 Pe Conne tion idsend idre v N1 N2

Table 2: Boundary onditions input data.

The odi ation ofboundary onditionsforblo k onne tionand pro essors on-

(30)

ghost cells

PE2 PE1

Receiving Sending

Sending Receiving

PE1

PE2

Figure 20: Data ex hange between two pro essors.

Five integer index are assigned to set the onne ted boundary onditions. The

onne ted side (i si) of the onne ted blo k is identied by the value of b var(1).

The values of b var(2) to b var(5) identify the starting indi es (i s1,i s2)and the

in rements (in 1,in 2) in the two dire tions of the side.

For onne tion betweentwo pro essors, the rst veindex havethe same mean-

ing. In addition other four index are assigned. Infa t for the su ess of the sending

andre eivingoperations sometagsarerequestedwhi hidentifythemessages. These

tags are read from input and they will remain the same for all the omputation.

b var(6) is the tag assigned to the message to send and b var(7) to the message

tobe re eived. b var(8) andb var(9)arethe numberofnodes inthetwodire tion

of the side.

In subroutine allo all the pointers are dened. In this subroutine the number of

ells ofea h pro essor is also al ulated. In the parallel aseit is usefulto havealso

denedthe totalnumberof ellsinthewhole domain. Forthiss opea olle tive mpi

routine is alled that returns to all the pro essors this information. The mpiroutine

is alled MPI ALLREDUCE the syntaxis:

all mpi allredu e(input;output; ount;datatype;operation; omm;error)

In this ase the input data are the n ell value of ea h blo k, the output is the

sum of them and the operation is mpi sum.

After the mesh is read the layer of dummy oordinates is set up. This is made

in subroutine dum oo.f. In the parallel ase it is needed a data ex hange between

near pro essors. Itis made as in a normal blo k onne tion, but the data ex hange

is made by using message passing as in gure 20.

The operation is divided in two parts.

First the ea h pro essor prepares in a buer the array with the information to

(31)

input input data

output output data

ount number of elements in send buer

datatype datatype of ea h send buer element

operation kind of operation to ompute

omm ommuni ator

error error number

that is alled for sending the messageis MPI IBSEND and syntax is:

all mpi ibsend(buf; ount;datatype;dest;tag; omm;request;error)

Se ond the pro essors re eive the messages and store the new information. The

steps are explained in gure 22. The subroutine alled for the re eiving operation is

MPI RECV with syntax:

all mpi re v(buf; ount;datatype;dest;tag; omm;error)

The arguments meaning are des ribed in table 3.

buf initial address of send buer

ount number of elements in send buer

datatype datatype of ea h send buer element

dest rank of destination

tag message tag

omm ommuni ator

request ontrol number

error error number

Table 3: Arguments in sending and re eiving operation

4.2 Solver

Boundary onditionsfor onne tionamongpro essorsaretreatedinthesamemanner

as for the dummy oordinates. This operation is repeated for all the dependent

variables.

As shown ingure 16inside the linear solver the boundary onditionsare setand

further someglobal operations are made.

The global operations, des ribed in se tion 2.5, are Eu lidean norms performed

(32)

2 1

2

1

2 SEND SEND

1 BUFFERS

Figure 21: Sending operation.

RECEIVING

RECEIVING 1

2

1 2 STORED

Figure 22: re eiving operation.

operations is not onserved numeri ally, be ause of the trun ation error; in some

ases the hange of the order an arry big dieren es in the result.

To rea h the same results the pro edure is divided in two parts, see gure 23.

First, ea hpro essor makes its ownserial omputation. Thenall the dataare putin

a single array in the rst pro essor exa tly as it would be in aserial al ulation with

n blo ks. This operation is possible in parallel by alling a mpi olle tive subroutine

alled MPI GATHERV.Pro essor#1performsthe normhavingallthedataneeded

and in the same order that in the serial ase. Of ourse the norm performed in this

way is safe for the result, but is very expensive in terms of omputational time. In

fa t the norm is made in serial way, so as with a single pro essor, and moreover a

olle tive mpi operation is alled. This is an intrinsi limitationof this kind of algo-

(33)

pe1 pe2 peN

data pe1

data

Figure 23: Norm in parallel.

was implementedin order to improveperforman es . The norms are al ulated sep-

arately by ea h pro essor and the global norm is obtained summing the results with

the help of mpi routine MPI REDUCE. it does not maintain the serial order; the

routineMPI REDUCEhasthesamesyntaxofroutineMPI ALLREDUCE,theonly

dieren e is that MPI REDUCE returns the result only to one pro essor. From a

mathemati al point of view the two methods are equivalent but not numeri ally. In

all test ases performed the dieren es appear not important and the results are

orre ts, with a powerful gain in performan e.

4.3 Convergen e

Finallythe historyof onvergen e iswrittenea htimestep. The maximumresidual,

the rms value and the position of the maximum residual are al ulated for ea h

dependent variable. These are global quantities so mpi olle tive operations are

required for omputing their orre t values. This is made in subroutine glob on

alled infssolver. Themaximumare al ulatedlo allyandthen themaximumamong

all the pro essors is hosen with its position (also the pro essor). The rms value is

omputed summing the ea h pro essor value. The mpi routine alled to do these

operations is MPI REDUCE. The operation omputed inside the mpi routine is

always mpi sum.

(34)

5 Validation

At this stage of development the test ases performed had to assure the orre t

workingofthe ode. Inparti ulartwoboundarylayer ases(alaminarandaturbulent

atplate),andaMoreau ombustion hamber. Allthe asesare omputedwithboth

the algorithms available to solve the linear system,the gs and the bi- gstab. In all

the ases the results are exa tly equal to those obtained with the serial simulation.

In gures 24, 25 and 26 are reported the onvergen e history in the three ases. It

is not possible to distinguish between the serial and the parallel simulations.

The speed-up of the ode is studied in one of the test ase. The Moreau om-

bustion hamberwas analyzedvaryingthe numberof pro essors from #1 to #8. In

the table 4 the result of the serial ode against those of parallel one is shown.

pro . Cpu time SpeedUp

1(2 blo ks) 438.26s 1.0

2(2blo ks) 230.17s 1.91

1(4blo ks) 482.29s 1.0

4(4blo ks) 189.01s 2.56

1(8blo ks) 535.97s 1.0

8(8blo ks) 135.24s 3.96

Table 4: Performan es

In the rst olumnthe number ofpro essors used is shown. The pu timegrows

with the number of blo ks due to the in reased omuni ation work inside the linear

system algorithm; this is the reason why a number of 1-pro essor runs have been

made with dierent number of blo ks. Then pu-times and speed-up fa tors are

shown. From the results it is possible to see that the performan e of the ode,

as expe ted, presents some s alability but the speed-up is far from the ideal linear

speed-up. The performan e of the parallel implementation will be assessed in more

details in the oming report about the overall validation of the ode.

(35)

0 50 100 150 200 iteration

−8

−6

−4

−2 0

u−residual

single pe parallel

Figure 24: Convergen e in the laminar atplate.

0 100 200 300 400 500

iteration

−4

−3

−2

−1 0

u−residual

single pe parallel

Figure 25: Convergen e in the turbulent atplate.

0 100 200 300 400

iteration

−8

−6

−4

−2 0

u−residual

single pe

parallel

(36)

A Input and Output les

Manyinformationsaboutthekindofsimulationthe odehaveto omputearewritten

in input les. The input and output les follow the form

prex:ext

where prex is the name of the test- ase. In the subroutine init these les are open

and read from the le PREFIX, and ext orresponds to the following extensions:

dat;msh;ini; on;out;te ;log:

In the ase of rea ting al ulation the les prex.pre and prex.di an be also

read. Oneor more lesprex.jetn should be present ifexternal velo ity or turbulent

proles are used.

A.1 The dat le

Inputdataareread fromthe datle. Hereare allthe informationsneeded to hoose

the kind of simulation ( hoi e of algorithms, model of turbulen e and ombustion,

geometri al informations et .). On e the dat le has been read, if requested by the

hosen model of ombustion , other input les are read with the right value of the

parameters used in the model.

A.2 The b le

The le with suÆx b ontains all the informations needed to a orre t treatment

of boundary onditions. The dimensions of grid blo ks are read. The indi es for all

sides and segmentsare assigned. The kind of onditions and the number of ells for

all the boundaries are read and stored in integer arrays. At this level the proles of

velo ity and of turbulen e variables are read if they are assigned.

Many kind of boundary onditions are allowed. There are denitions for dierent

adiabati walls,inlet andoutletproles. Thereare alltheparametersneeded inblo k

onne tions. The multi-blo k stru ture requests a hange of data between near

blo ks to asses the orre t value of residuals on the boundary. The data ex hange

is done in only one step. The memory lo ations orresponding to two layers of

ghost ellsare lled with values ofthe orresponding innereld dependent variables.

Theserepresentalltheinformationsneededto omputethe uxesthrough onne ted

boundaries.

Using the informations ontained in dat and b the work data stru ture is set

(37)

A.3 The msh le

The mesh is read from the binary le msh. To treat orre tly the boundary two

layers of dummy oordinates are onstru ted around the boundaries of ea h blo k

using the value of oordinates and normals of the boundary ells.

This is made in the subroutine dum oo. All volumes and normals, also for the

dummy ells, are set.

A.4 The on le

Be ause the simulations that mustbe done are almost always stationary, the al u-

lation is expe ted to go to onvergen e after a number of iteration. It is important

to monitor this behavior in order to de ide when and if the omputing is gone to

su ess. The most ommon hoi e is to follow the advan ing of the right hand side

of themomentequations hanged in sign,whi h, in astationary al ulation, goes to

zeroiteration after iteration. This hand ofequations are the uxesof the dependent

variables whi h are alled also residuals. In ode Ares the residuals of all variables

usedinthe al ulationare stored. Intherst3iterationtheir absolutevalueisstored

andwritteninle. on. The thirdistakenasreferen eresidualres

0

forthe following

iteration. At the iteration i the residuals are related to the referen e value as

log

10 (

res

i

res

0 );

this expression is evaluated for the sake of simpli ity. The al ulation is onsidered

gone to engineering onvergen e when this expression is less then 10 5

.

With the value of normalized residuals their lo ation in the grid is also written.

For ea h variable the indi es i;j;k of the ell and the blo k where the maximum

absolute residual is found are written.

A.5 Restart simulations

When the simulation is a restart of an older one, the given solution is read from

the ini le, that ontains the solution of a previous al ulation. The ode mustalso

know the onvergen e history of the previous al ulation. Hen e the on le of the

previousrunmustbepresent. The onle ontainsallthe onvergen einformations,

the last iteration, the time step and the valuesof maxresiduals.

A.6 The Output Files

The output les are opened at rst iteration. The values, in every ell of the om-

putational domain, of the eld variablesatthe end of arun session are stored inthe

(38)

The te le is an unformatted le intended for a su essive management of

the solution data (visualization, further elaboration, et .). In the te le the mesh

oordinates and the value of elds variables are stored.

The log le is an as ii le reporting a summary of the run hara teristi s.

(39)

Referen es

[1℄ Mulas, M.,Beeri, Z.,Golby, D.,Surridge, M., and Tali e, M. "A parallel Navier-

Stokes ode for large industrial appli ations". P. Kutter et al. (eds), Le ture

Notes in Physi s, Springer 1997, pagg. 450-455.

[2℄ Mulas, M., Chibbaro, S., Delussu, G., Di Piazza, I., and Tali e, M. "The CFD

ode Karalis". CRS4-TECH-REP 00/87, 2000.

[3℄ Met alf,M,Reid,J. FORTRAN90/95explained. OxfordUniversityPress,1996.

[4℄ XL FORTRAN for AIX, Language Referen e. IBM, 1994.

[5℄ XL FORTRAN for AIX, User's Guide. IBM, 1994.

[6℄ Appunti s uola di parallelismo del CINECA. CINECA, 1998.

[7℄ MPI:A Message-Passing Interfa e Standard. 1995.

CRS4CeefAdva edS dieReea hadDeve

COMPUTATION LOOP OVER BLOCKS do ibloc=1,nbloc

END LOOP OVER BLOCKS end do

Splitting

2 layers of ghost cells Interior domain

cell center (ni-1,nj-1)

cell center (1,1) grid point (1,1) grid point (0,0) cell center (-1,-1)

cell center (ni+1,nj+1) grid point (ni+1,nj+1)

grid point (ni,nj)

i

face 6 face 4

face 1 face 2

face 3 face 5

j k

block 1

block 2 1

NLAST

NMAX Permanent

Temporary Storage

Storage

block NBLOC

stack

iodiap(ibloc)

(p,u,v,w,f ...f,k ,ε) Dependent variables

idepvp(ibloc)

idepop(ibloc)

idisp(ibloc)

imup(ibloc) idiap(ibloc)

ivolp(ibloc) icoop(ibloc)

inorp(ibloc)

illp(ibloc) irrr(ibloc)

Dependent variables (p,u,v,w,f ...f,k,ε)

Old

Distance from the closest surface Density (ρ)

Turbulent viscosity Diagonal pressure correction Old diagonal pressure correction

Volumes Coordinates (x,y,z)

Face-normals (NIx,NIy, NIz, (NJx,NJy, (NJz,

(NKy,

(NKx, (NKz)

Cell reference lenght Time steps

(Spal.-Allmaras)

idtp(ibloc) ncjet ivcomp

jet cells

Injet boundary field

1

1

2

2

ghost cells

open output files

calculate distance from

wall (SA) geometry

set up read mesh

readbc

alloc call call read input files

store

set up data structure B.C. information

advance in time variables

Set BC + TURB

Store old values

Compute time step

Set BC + TURB Convergence

check fststep

fsscal setbc fsmutur turbsbc fskeps fsvel fsmutur turbsbc setbc

fsdensity

fspressure LOOP

ITERATION TIME

linear system system matrix

build up allocate

advancing fields at next time step

deallocate system matrix

solving linear system

over blocks loop over blocks

loop

B1 B2

BN X

B1 B2

BN

RHS B1

B2

CRS4CeefAdva edS dieReea hadDeve