Centre for Advan ed Studies,
Resear h and Development in Sardinia
Uta - (CA)
Programma ENEA-MURST
Obiettivo 7
Implementazione del odi e Ares su ar hitetture
parallele
by:
S. Chibbaro and M. Mulas
Computational Fluid Dynami s Area
This report des ribesthe MPIparallelization ofthe ombustion odeAres. First,
the stru ture of the ode is des ribed, then a detailed analysis of the algorithms
used to integrate the ow equation is arried out, with parti ular emphasis to the
parallelizaztion issues. An analysis of few possible alternative approa hes is then
followed by the des ription of the hosen approa h. A preliminary validation of the
parallel ode ompletes the report. Thourough validation on 3D omplex test ases
will be presented in the next report.
1 Stru ture of the ode 2
1.1 The Multi-Blo k Stru ture . . . 2
1.2 Dataallo ation . . . 5
1.3 Boundary onditions. . . 7
1.4 TimeIntegration . . . 9
2 Code des ription 11 2.1 Initialization . . . 11
2.2 The Solver . . . 12
2.3 Integration Module . . . 13
2.4 fsvel subroutine . . . 16
2.5 fs g subroutine . . . 18
2.5.1 CGS solver . . . 19
3 Analysis of parallelism 24 3.1 Approa h #1 . . . 24
3.2 Approa h #2 . . . 24
3.3 Approa h #3 . . . 25
4 Code Parallelization 26 4.1 Initialization . . . 26
4.2 Solver . . . 28
4.3 Convergen e . . . 30
5 Validation 31 A Input and Output les 33 A.1 The dat le . . . 33
A.2 The b le . . . 33
A.3 The msh le . . . 34
A.4 The on le . . . 34
A.5 Restart simulations . . . 34
A.6 The Output Files . . . 34
Referen es 36
1 Stru ture of the ode
Ares ode belongs to a family of odes developed in the past years by the group of
Fluid Dynami s of CRS4 (see referen es [1℄ and [2℄ for instan e). It an be divided
in three main parts.
INIT:all variables are initialized and dataare allo ated.
FSSOLVER: residuals are omputed and all equations are advan ed in time
with an impli it fra tional step method.
OUTPUT: all the output elds are written.
This reportbegins with the des ription ofthe global stru ture of the ode. Then
the multi-blo kstru ture, the boundary onditions odi ation methodandthe data
allo ation are illustrated.
1.1 The Multi-Blo k Stru ture
In ell- entered Finite-Volume odes the boundary ondition treatment is done by
means of a layer of ghost ells surrounding the omputational domain. In ase of
multi-blo k meshes, time-advan ing iterations should be performed on ea h blo k
independently and, eventually, in parallel, as shown in gure 1:
COMPUTATION LOOP OVER BLOCKS do ibloc=1,nbloc
END LOOP OVER BLOCKS end do
Figure 1: The blo k loop stru ture
Two layers of ghost ells are then used to store both the geometri al deni-
tion (volumes and surfa e normals) and eld information (values of the dependent
Figure 2 shows how the domain splitting pro ess onsists of onstru ting the
appropriate ghost ells for ea h sub-blo k, whi h in turn means dupli ating the in-
formation onboth sides of the interfa e ut.
Splitting
2 layers of ghost cells Interior domain
Figure 2: Domain splitting
IfNI, NJ andNK represent thenumberofgrid nodes ofastru turedmesh,then
all of the eld arrays within a given blo k are dimensioned in the following way:
ARRAY(-1:NI+1,-1:NJ+1, -1: NK+ 1)
nomatterwhether theyrepresent ell- enter values(su h asvolumes, dependent
variables, time-steps, ...), node values (su h as grid point oordinates), or values of
physi aland geometri alentitieslo ated atthe ells'interfa es (su has for instan e
surfa e normals and uxes). Figure 3 shows, for the sake of simpli ity, a 2D grid
made of NI NJ nodes. Internal ells range from index (1;1) (lower-left orner),
to index (NI 1;NJ 1) (upper-right orner), while node- entered entities arrays
range from index (1;1) (lower-left orner), to index (NI;NJ) (upper-right orner).
cell center (ni-1,nj-1)
cell center (1,1) grid point (1,1) grid point (0,0) cell center (-1,-1)
cell center (ni+1,nj+1) grid point (ni+1,nj+1)
grid point (ni,nj)
Figure 3: Grid nomen lature
Figure 4 shows how the dierent blo k fa e are identied for the appli ation of
boundary onditions. For generality and exibility purposes, ea h fa e an be subdi-
videdinvariousportions(segments),asshowning. 5,ea hsegment orresponding
to a dierent boundary treatment.
i
face 6 face 4
face 1 face 2
face 3 face 5
j k
Figure 4: Fa e nomen lature
Figure 5: Fa e splitting
1.2 Data allo ation
A one dimensional double pre ision array work is used to store all data: as shownin
gure6 the rstpartof work isreserved for permanentdata, whereas ase ond part
is used to dynami ally store temporary data.
block 1
block 2 1
NLAST
NMAX Permanent
Temporary Storage
Storage
block NBLOC
stack
Figure 6: The work array stru ture
The work array is dimensioned in the main program as:
DOUBLE PRECISION WORK(NMAX)
where NMAX is a parameter whi h an be modied by the user depending on the
size of the test ase. Figure 6 shows the array stru ture: memory is allo ated for
all permanent dataof all blo ks (the total numberis NBLOC). NLAST represents
the last array lo ation allo ated for permanent data. The spa e left for temporary
storage is then made by (NMAX NLAST) lo ations.
The temporary storage is managed as asta k with two utility routines whi h a t
as the C-fun tions pop and push.
Ea h blo k area, shown in gure 7,is organized in the same way. The data an
be a essed by means of arrays of pointers, all of them dimensioned as
iodiap(ibloc)
(p,u,v,w,f ...f,k ,ε) Dependent variables
idepvp(ibloc)
idepop(ibloc)
idisp(ibloc)
imup(ibloc) idiap(ibloc)
ivolp(ibloc) icoop(ibloc)
inorp(ibloc)
illp(ibloc) irrr(ibloc)
Dependent variables (p,u,v,w,f ...f,k,ε)
Old
Distance from the closest surface Density (ρ)
Turbulent viscosity Diagonal pressure correction Old diagonal pressure correction
Volumes Coordinates (x,y,z)
Face-normals (NIx,NIy, NIz, (NJx,NJy, (NJz,
(NKy,
(NKx, (NKz)
Cell reference lenght Time steps
(Spal.-Allmaras)
idtp(ibloc) ncjet ivcomp
jet cells
Injet boundary field
Figure 7: Details of the work array stru ture
where NBLOC MAX is user denedparameter whi h identiesthe maximumnum-
berof blo ksallowed. All pointer arrays are storedin the COMMONblo k,together
with the following ones:
NNI(NBLOC MAX)
NNJ(NBLOC MAX)
NNK(NBLOC MAX)
NCELL(NBLOC MAX)
whi h store the dimensions of the mesh (NI;NJ and NK) and the total number of
ells (whi h in ludes the internal plus two layers of ghost ells) of the given blo k
(NCELL=(NI+3)(NJ+3)(NK+3)).
The table 1 summarizes all dened arrays of pointers: for ea h one a very brief
(in the general ase of turbulent ombustion with four s alars); the size is to be
intended in unit of NCELL.
The storage ofold dependentvariables isneeded for the time advan ingsolution
method.
Name des ription size
IDEPVP dependent variables 10
IDEPOP old dependent variables 10
IDISP distan e from solid walls 1
1
IRRR density 1
IMUP turbulent vis osity 1
IDIAP pressure orre tion 1
IODIAP oldpressure orre tion 1
IVOLP volumes 1
ICOOP oordinates 3
INORP normals 9
ILLP ell referen e length 3
IDTP time step 1
Table 1: List of arrays of pointers.
Besides the pointers topermanentmemorylo ations, in theCOMMONblo ksix
pointers are stored. Theyare usedastemporary, andsopoped and pushedwhenever
itisneeded. Generallythesearraysarenotstati allystoredinCOMMONbutde lared
in the subroutines where they are used. These six arrays are stored in COMMON
be ause theyare used in the solver, at ea h time step, in many subroutines; so it is
usefulto havethemstati ally stored;the ee t isthatjust insidethe solver theyare
treated as permanent pointers. They are imat, irhs, ix g, that are used in solving a
linear system;ilpre , idpre , iupre that are used to pre ondition the matrixes ofthe
linear system, whenever a pre onditioner is used.
1.3 Boundary onditions
In ARES ode anumber of hoi es are available to imposephysi al boundary ondi-
tions.
In aseofmulti-blo kmesheswheredierentblo ksare onne ted, itisne essary
to ex hange data between them. Two layers of ghost ells surrounding the blo k
omputational domain are then used to store eld quantities.
The data ex hange is s hemati ally shown in gure 8. After the data ex hange,
1
1
2
2
ghost cells
Figure 8: Data ex hange
and external data, in su h a way the uxes are omputed as ifthe blo k subdivision
didn't exist.
The meshes used are stru tured and onformal. This implies that all the points
are dened by three indi es I; J; K. In subroutine readb the parameters used for
the boundary onditionssetting are read. First, 5indi es are read for allthe sides of
ea h blo k:
itype, ist1,ien1, ist2,ien2.
They are 5 integer whi h represent the type of BC to be applied (in ow, out-
ow, onne tion), the value of the rst and of the last node of the two indi es
whi h des ribe the side onsidered. For example for a side dened by the equation
I = 1 the indi es J and K an hange, and the parameters ould be, for instan e,
ist1 = 1;ien1 = nj;ist2 = 1;ien2 = nk. An important feature of ARES ode is
the generality of the orientation of ea h blo k with respe t to the others, the only
limitation is thatonly stru tured and onformal grids are permitted.
In the ase the "itype" parameter identies a blo k onne tion, a se ond line of
parameters is also read. These are:
i si, i s1, in 1, i s2, in 2.
They are 5 integers whi h indi ate the side onne ted, the value of the indi es
at the rst node and their in rements (+ or - 1).
All theseparameters are stored in integer pointers in the COMMONblo k.
In subroutine dum oo a layer of ghost nodes is built up. So two layers of ghost
ells are formed. In blo k- onne tions the value of the mesh at ea h node is lled
with anex hange ofdataas showningure8. Afterthe ghost ells onstru tion, all
the indi es of the boundary onditions are stored with new values. In fa t they are
written now for ell entered boundary and in luding the ghost ells. This is made
In the onne tions among blo ks it is basi to load and unload the array to be
ex hanged with theright order, be ause of the generality of the orientationsof ea h
blo k.
Thisispossible knowingtheintegerparametersreadinreadb . Theseparameters
des ribe how the side iside of the blo k iblo is onne ted with the side i si of the
blo k i bl. A simply algorithm indi ates also the orientation of the blo k onne ted
and permits the loading of the array to be ex hanged with the right order, it is:
in 3 = 1
iss = i si +iside
iss2 = (iss=2)2
if(iss2:eq:iss) in 3 = 1
in = max(in 1in 2in 3;0)
(1)
This pro edure xes the value of in, to be 0 or 1. If it is 0 the orientation of
the blo ks is su h that the order of the indi es must be inverted with respe t of the
natural order.
1.4 Time Integration
Equations for velo ity, pressure and eventually s alars and turbulent elds are ad-
van ed in time by using an impli it integration method. The general form of the
dis retized systemof partial dierential equations to be solved an be writtenas:
Q
t
=Res
n+1
(2)
Byusinganimpli its heme, the righthand sideofsystem2 atthetemporalstep
n+1 an be obtained by its linearization:
Res
n+1
=Res
n +
Res
Q
!
n
Q (3)
so that in order to advan e the eld from the time n to the next time n+1
equation 2 an be rewritten as:
Q=Res
n
t +
Res
Q
!
Qt (4)
By dening the Ja obian matrix J
n as J
n
=
R es
Q
n
equation 4 be omes:
I
t J
n
!
Q=Res
n
(5)
whi h has the general matrixform:
Mx=rhs: (6)
where M is the matrix of the oeÆ ients and x the unknown ve tor.
The right hand side of system6 ontains the ell residuals of the variables to be
solved. The formof these uxes depends onthe spatialdis retization s heme. With
impli it time advan ing method it is ne essary to solve the linear system 6 at ea h
timestep. In ARES ode there are two algorithms available: CGSand BI-CGSTAB.
They are two standard libraries implemented in the ode via publi domain libraries
template. Theysolve the linear system6 through aniterative pro edure. Theyhave
the same stru ture and in the following analysis both of them will be alled simply
CGS solver for the sake of simpli ity. If requested a pre onditioner is joined to the
CGS solver.
The equation ofmotion, ompressible Navier-Stokes,aresolved with afra tional
step method. In this method, rst a velo ity eld is omputed (that does not ne -
essarily respe t the steady ontinuity equation); then the orre t pressure eld is
omputed; later using the value of pressure eld the velo ity is orre ted to enfor e
the ontinuity onstraint.
2 Code des ription
2.1 Initialization
In the main program init is alled.
open output files
calculate distance from
wall (SA) geometry
set up read mesh
readbc
alloc call call read input files
store
set up data structure B.C. information
Figure 9: The init stru ture
In this subroutine all the initialization operations are made following the s heme
of gure 9.
All the input les are read.
The subroutine readb is alled; here all the information about boundary ondi-
tions are read and stored.
The subroutine allo is alled, where the data stru ture des ribed in paragraph
1.2 is set up.
The layer of ghost nodes is built up. All the geometri al ell variables (volumes
and normals) are al ulated. In the ase of a turbulent simulation with Spalart-
Allmaras model the distan e from the nearest wall for ea h ell is omputed.
Finally, all the output le are opened.
2.2 The Solver
On etheinitialization isnishedthesubroutine fssolveris alled frommainprogram.
The solver is the ore of the whole program and represents the time integration
module.
advance in time variables
Set BC + TURB
Store old values
Compute time step
Set BC + TURB Convergence
check fststep
fsscal setbc fsmutur turbsbc fskeps fsvel fsmutur turbsbc setbc
fsdensity
fspressure LOOP
ITERATION TIME
Figure 10: The solver stru ture
With referen e to gure 10 atea h iteration step the following is performed:
1. the subroutine setb and sab or turbsb , respe tively if Spalart-Allmaras or
turbulentmodelisused, are alled;soboundary onditionsare appliedfor
all the dependent variables. In turbulent owsthe subroutine fsmutur is alled
for omputing turbulent vis osity.
2. values of dependent variables and of the pressure orre tion at previous time
steparestored;the dependentvariables,asshownin paragraph 1.2,are P,that
is the pressure;U,V,W, whi h are the three velo ity omponents; f
i
, whi h are
3. Time step is omputed. It is al ulated on a lo al ell-basis, on e the global
CLF number is provided as an input information ( fl = a t
x
1).
4. All the dependent variable elds are advan ed in time. The subroutine fsvel is
alled thatadvan es the velo ity.
The subroutine fspressure is alled to advan e the pressure and orre t the
velo ity.
The subroutine fskeps is alled to advan e the turbulent eld following the
hosen turbulent model.
The subroutine fss al is alled to ompute the s alar elds
The subroutine fsdensity is also alled. The density is al ulated through the
equation of state.
The subroutine fsmuturis alled, if the al ulation is turbulent, for al ulating
the new value of the turbulent vis osity eld.
5. Boundary onditions are applied to all dependent elds again.
6. Che k on onvergen e level is made in order to test whether a prexed value
has been rea hed.
Blo k-loops are performed inside ea h module of routine fssolver. From a paral-
lelization easiness pointofview, thefa tof havingaloopover theblo ks insideea h
oneof the previously enumeratedmodules isquite appealing, be ause ittswith the
natural blo k to pro essor approa h.
All the modules but the fourth one perform their omputation independently on
ea hblo kandareintrinsi allyparallel. Communi ationsamongtheblo ksarewhere
boundary onditions are imposed and in module 4. A more detailed des ription of
what is performed at point 4 is presented in the next paragraph for a lari ation.
2.3 Integration Module
Inthe aboveparagraph ithasbeen shown,gure10,thatinorderto advan eintime
the variable elds ve subroutine named fs"variable" are alled fromthe subroutine
fssolver. It has been noted that the time mar hing algorithm is an impli it one and
thatit involvesthe onstru tion and the solution of an algebrai linear system. This
is made in these subroutines.
These subroutines have all the same stru ture, whi h is shown in gure 11. So
only the rst alled, that is fsvel, will be analyzed, being impli it that the same
linear system system matrix
build up allocate
advancing fields at next time step
deallocate system matrix
solving linear system
over blocks loop over blocks
loop
Figure 11: Stru ture of the generi fs"variable" subroutine. The se ond module is
des ribed in g.13.
Some omments must be done on gure 11. To advan e in time the variable
elds an algebrai linear system of the form6 has to be solved. Atthe beginning of
allof thesesubroutines fs"variable"the matrixM,here alled A,the unknownxand
the right hand side rhs are allo ated in work as temporary arrays, they are pushed.
Asnotedinparagraph 1.2, thepointerswhi hindi atethememorylo ationsofthese
arrays are imat for A, ix g for x and irhs for rhs. They are deallo ated only at the
end of the subroutine, gure 11, so they are treated as permanent arrays during all
the al ulation inside ea h subroutine fs"variable". The dieren e between a real
permanent array, for example that indi ated by pointer idepvp, is thatea h variable
needs a dierent pointer for a permanent array while the same pointers imat, ix g,
irhs are used in the solution of the linear system for ea h variable, that is in all fs
"variable" subroutines. At this levelit is known the sizeof these arrays. Dimensions
of matrixes A, x and rhs are:
A = (7(NI+3)(NJ+3)(NK+3)nvar)
x = ((NI 1)(NJ 1)(NK 1)nvar) (7)
where NI 1;NJ 1;NK 1 are the total number of internal ells in all the
omputationaldomain, allblo ksin luded; nvar isthenumberofvariables al ulated
and updated in the fs"variable" subroutine, for example they are 3 in fsvel for the
velo ity and1infspressure for thepressure. The matrix orresponding tothepointer
imat is not alled M but A be ause of its dimensions. In the linear system 6 the
matrix M must have a number of olumns equal to the dimensions of the unknown
x. Instead the memory allo ation lo alized by the pointer imat for the matrix A is
larger, for both internal and ghost ells are pla ed in the olumns.
B1 B2
BN X
B1 B2
BN
RHS B1
B2
BN
M
work(ixcg)
work(irhs) work(imat)
Figure 12: Matrixes of the linear system
Spe ial attention has to be put on how the matrixA, the unknown ve tor xand
rhs of system6are stored. The riterion used to store thesearrays is dierent from
as explained in gure 12.
The matrixes are evaluated inside a loop over blo ks. At the end of the loop
all matrixes , even if omputed independently, are stored all together. So all the
data on erning thelinear systemare stored inwork. The fa tthatthese arrays are
onstru ted inside a loop over blo ks is re e ted by their form, as shown in gure
12, they are \blo k-matrixes".
At this level there is no ommuni ation among the blo ks. At this point the
overall system6 has to be solved, where now A is a blo kmatrix.
In gure 11 it has been underlined whi h parts of the subroutines fs"variable"
are inside a loop over blo ks. They on ern the onstru tion of the linear system
and the updating of the variable elds, whi h takes pla e after the solution of the
linear system. This is important be ause the al ulation inside a loop over blo ks
is distributed among them and is omputed independently by ea h one, in total
agreement with the one-blo k-one-pro essor approa h. The module on erning the
solution of the linear system is outside from the loop. This means that not all the
omputation is performed independently by ea hblo k.
In the nextparagraph the dierent modules ofgure 11will be "opened" forthe
analysis of the fsvel subroutine, as example for all the fs"variable" subroutines, in
order to better omprehend the ARES stru ture and parti ularly its parallelization
properties.
2.4 fsvel subroutine
With referen e to gure 11, the rst module ontains the arrays allo ation.
In the se ond module the linear systemis built up. The ow hart of this module
is shown in gure 13. It has been put in eviden e in gure thatthe whole module is
inside a loop over the blo ks.
With referen e to equation 6 it is ne essary to al ulate the matrix M and the
right hand side. First the matrix M is al ulated. This is done in three parts, as
shown in gure 13. The Ja obian matrix is al ulated sequentially for the invis id
terms(routine fsja inv), the vis ous terms(routine fsja lap) andthe unsteady term.
These terms are stored in the array A (the pointer imat).
Itisremarkabletonotethatthisstru tureissharedbyallthesubroutinefs"variable".
The dashed line logi blo k in gure 13 is exe uted for the momentum equation
only (fsvel). Here the pressure-velo ity orrelation is savedfor the future use by the
pressure orre tion s heme.
On e the onstru tion of matrix A is nished, the right hand side rhs is al u-
lated. It ontains the ell residuals of the variables to be solved. As in the ase
of matrix A, rst the invis id terms are omputed (subroutine fs xinv), then the
LOOP OVER BLOCKS Compute the
inviscid part of Jacobian matrix Add the jacobian for the viscous fluxes
Add the unsteady term
Save pressure- velocity gradient correlation for the projection method
Compute inviscid and viscous fluxes
system RHS and compute residual norm
Build-up the fsjaclap
fsflxvis fsflxinv fsjacinv
(fsjaclaprng)
Figure 13: Impli it solver stru ture: rst loop over the blo ks
use of the ghost ells. The operations are made with the help of a temporary array
work(i x) This array is pushed with the right dimensions:
(NI+3)(NJ+3)(NK+3)nvar:
When all the operations are made the internal ells of work(i x) are opied into
the array rhs and the temporary array work(i x) is poped. This is made in the last
module of gure 13.
In thismodulethe maximumresidualvalueis nally al ulatedformonitoring the
onvergen e. At this point the linear system is built up and the loop over blo ks
ends.
The linear system is solved in the se ond part of the impli it solver, the third
module of gure 11. The ow hart is represented in gure 14.
It has to be noted from gure 14 that the old values of the variables to be
omputed are rstly saved into a temporary array (subroutine fsf2t). This is done
be ause their permanent allo ation in the overall working array work is used inside
Save the generic field into a temporary array
restore the temporary array into the permanent one
solver call the linear
fsf2t
fscg
fst2f
Figure 14: Impli it solver stru ture: all to linear solver
2.5 fs g subroutine
The stru ture of subroutine fs g is reported in gure15. In ARES ode a onjugate
gradient squared algorithm is used for solving the algebrai linear system. This
method may require a pre onditioning of system 6. The pre onditioning is arried
out multiplying the ve tor x by a ompatible matrix, omposed by a hoi e of A
oeÆ ients. The ne essary matrixes for pre onditioning are allo ated as temporary
in work. There are four hoi es:
1. no pre onditioning
2. diagonal pre onditioning; the pointer idpre is allo ated;
3. ILU pre onditioning; the pointers ilpre and iupre are allo ated;
4. SORpre onditioning; the pointer idpre is allo ated.
In the memory area lo alized by the pointers the oeÆ ients needed for the pre-
onditioning are al ulated andstored. The subroutine fsdiag is alled for adiagonal
pre onditioning, the subroutine fspilud for ILU pre onditioning and the subroutine
fsdiag for SOR pre onditioning.
Then CGS solver is alled. There are two available subroutines: gs and Bi-
gstab. The systemis pre onditioned and solved in the unknown x with an iterative
preconditioning matrixes check convergence
matrixes
Bi-cgstab cgs the CGS solver
selection of
deallocation
fspilud (ILU) allocation of
preconditioning selection
fspdiag (diagonal or SOR)
Figure 15: fs g stru ture
A he k onthe onvergen eof the solver iterative pro edure ismade atthe end,
if the toleran e requested is not rea hed, the array x is set to zero value.
In the last module of subroutine fsg the temporary arrays allo ated for the pre-
onditioning are deallo ated.
It has been noted thatthe two linear systemsolvers have the same stru ture, so
a generi CGS solver is des ribed. The observations are the same for both ases.
2.5.1 CGS solver
A s hemati ow hart of the module "CGS solver"is given in gure 16.
With referen e to gure 16, rst the right hand side array ontent is opied in a
temporary array for omputing anoverall norm. Both theseoperations are made for
all the (1: ni 1;1:nj 1;1:nk 1;nvar;nblo ) elements of the RHS ve tor.
It is important to note that while the rst step ( opying) an be easily parallelized
without any needing of ommuni ation among pro essors, the se ond one requires
message passing instru tions.
In the third module of gure 16, all the s alar parameters requiring inversion are
Copy RHS array to a working array Compute a global norm to check for convergence
TEMP(i) = RHS(i)
i=1,2,...,NTOT (# of RHS elements)
1 i NTOT
RHS(i)
2 1/2
Compute global constants needed by the algorithm
set BC’s
set BC’s
Perform a matrix times vector product to invert the system of the system
preconditioning
Solve the linear
system
update the solution vector x
Return YES NOT
convergence check
do ibloc = 1,NBLOC
do ibloc = 1,NBLOC array x
the value of correct
iteation CGS
fspsolve
fsmatvec
Figure 16: CGS solver stru ture
kind of libraries are totally lo al, that is of the form:
A(i)=B(i)+C(i)
where and are parameters xed by the algorithm, and A;B;C are some arrays.
So theyare made independently ell by ell, then they are intrinsi ally parallel.
On e allthesepreliminary operations aremade thelinear system an be inverted.
First the system is pre onditioned, if requested. The subroutine fspsolve is alled.
In gure 16 the rst logi al blo k joined to fspsolve is dashed be ause not all the
pre onditioner need the setting of boundary onditions.
As shown in equations 8, the dimension of the olumns of the matrix A is not
the same of the ve tor x, .
InordertoresolvethisdiÆ ultytheve torxis opied intheareaofworkmemory
dedi atedtothe orrespondingdependentvariableeld;forexampleiftheCGSsolver
this asethe ve tor xis opied inthe workarea lo alizedbythe pointeridepvp, from
idepvp +NCELL for the rst 3NCELL, that is where the velo ity eld has its
permanent storage, gure 7. The resulting ve tor has a number of points equal to
thepointsinthe olumnsofA, beingdenedthedependentvariablesbothininternal
and in ghost ells. These points insidethe subroutine fspsolveare rea hed by means
of a lo al pointer named i opy. It is simply:
i opy =idepvp(iblo )+ifld n ell(iblo )
where i d is the parameter that indi ates the eld ℄ onsidered, in the example its
value is 1. The matrix operations are made with this array work(i opy) and at the
end updated values are opied on x. The ghost ells are not used in these two kind
of pre onditioning.
Inthe aseofdiagonalandILU pre onditioning thestru tureisexplained ingure
17.
in array work(icopy) copy x
conditioning
copy updated work(icopy)
in x
copyx2f
pdiag (diagonal case) (ILU case)
copyf2x pilu
loop over blocks
loop over blocks
loop over blocks of vector work(icopy)
Figure 17: fspsolve stru ture in the ase of diagonal or ILU pre onditioning
Some olumns of the matrix A are stored in the allo ated array indi ated by:
idpre , in the diagonal ase; ilpre and iupre in the ILU ase. The ve tor x is
multiplied and divided by linear ombination of them.
In the ase of SOR pre onditioning the operations are arried out following the
gure 18. The stru ture is very similar. But in this ase the pre onditioner makes
use also of the ghost ells to ondition the internal values; so it is ne essary to ll
the ghost ells elements of the array work(i opy). Itis important to note that only
the internal ellsare used for elds timeupdating; this is why the matrixA, and the
in array work(icopy) copy x
copyx2f loop over blocks
B.C. setting zerosbc
conditioning
copy updated work(icopy)
in x
copyf2x loop over blocks
psor loop over blocks
of vector work(icopy)
Figure 18: fspsolve stru ture in the aseof SOR pre onditioning
The ghost ells of ve tor work(i opy) are lled by setting boundary onditions like
as for dependent variables, in the subroutine zerosb ;the operation wasexplained in
paragraph 1.3 and gure 8.
Asshowningure16,on ethepre onditioningisdonethealgebrai linearsystem
is solved. Matrix-ve tor produ ts are ne essary for the s ope. These operations
are made in subroutine fsmatve . The produ t is between the matrix A and the
pre onditioned ve tor x.
The subroutine fsmatve follows astru ture of the same kind of fspsolve, gure
19. The opy ofthe ve tor work(i opy)ontheve torx atthe endof thesubroutine
is not done here be ause the ve tor x is un hanged by this subroutine. Matrixes
dimensions must be ompatible, then the array work(i opy) is used with the ghost
ells lled in the subroutine zerosb . The presen e of B.C. setting, as usual,implies
the onne tion among blo ks.
Itmustbenoted thatthe area ofpermanentmemorydedi ated tothe dependent
variables(thatisthe area lo alizedbypointer idepvp, gure7)hasbeenoverwritten,
solving the linear system. So it is basi the operation of saving dependent variables
before alling fs g subroutine and of restoring them after it, in ea h fs"variable"
in array work(icopy) copy x
copyx2f loop over blocks
B.C. setting zerosbc
A*x loop over blocks
Figure 19: fsmatve stru ture
The solver stru ture analysis leads to the on lusion that in the multi-blo k ase
a sort of domain de omposition is made, where the sub-domains are the blo ks. It
is important to note that this stru ture together with the impli it time advan ing
method auses dieren es between the multi-blo k and the one-blo k ase.
When there is a blo k subdivision, a omputational point is in uen ed by all the
other points of the same blo k, and by the two layers of ghost ells, that represent
the rst two layers of ells of the adja ent blo ks, gure 8, and not by the whole
domain as in aone blo k onguration.
3 Analysis of parallelism
In the last se tion the blo ks onne tion (B.C., global operations) and the blo k-
loop stru tures have been underlined. Only the rst one must be hanged for the
parallelization, be ause the blo k loops are intrinsi ally parallel. Three alternative
hoi es are possible to a hieve the goal of ARES parallelization, namely:
1. to parallelize only the rst part of the ode and not the solver;
2. as the previous point 1) but using parallel publi domain libraries to solve the
linear system;
3. to extend the parallelization also to the loops over the blo k inside the linear
solver module;
Pros and ons of the above possibilities are dis ussed brie yin the following.
3.1 Approa h # 1
Itis the simplest onebut it isalso the less eÆ ient. Substantially, the parallelization
is limited to the building up of the linear system whi h is then solved by a serial
algorithm. Byfollowingthisapproa hthetotalnumberoftheneeded ommuni ation
at ea h time step among the pro essors is 2nvar ntxvol +nvar nt ell 7.
From previous experien es madeon similar odes and be ause of the limitednumber
of the ommuni ations among pro essors, itseems reasonable that the parallel part
of the ode would s ale linearly with the pro essors number. In order to get some
idea of the possible gain in term of speed up of the omputation, a proling of the
s alarversion ofAREShas beenrun. Resultsproofthatalmost50% oftherun time
is spentfor the linear systemsolution, thus in the partof the ode not interested by
the parallelization pro edure. In theideal ase oflinear s alability of the parallel part
and of an innity number of pro essors, the maximums alability fa tor would then
be only2.
3.2 Approa h # 2
A ording to this se ond alternative, the same steps performed at point #1 would
be implemented, plus the use of publi domain parallel linear solver libraries to be
linked somehowtothe existing ode. Inordertoa hievesu hawayofparallelization
somesteps must be performed. Publi domain libraries of gs and bi- gstab parallel
solver have to be found. These libraries are built and optimized using a ompa t
for the transformation of the stru ture obtained in subroutines fs"variable" (gure
12) in the stru ture requested by the publi domain library.
Problems onne ted tothis se ond hoi e are inthe rstpla ethatsu hsuitable
libraries have to be identied and implemented into the ode and, se ondly, that at
this level, with only an \a priori" analysis of the ode, it not possible to anti ipate
the ode performan es.
3.3 Approa h # 3
Thisapproa hwould onsistinmakingparallelalltheloops overtheblo kswhi hare
usedintothe ode. Thesameserial odestru turewouldbesubstantiallymaintained,
but the orre t ommuni ations have to be ensured among blo ks. This means to
implementamessagepassingstru turethatrepli atestheserialB.C.setting;theyare
present in subroutine fssolver, inside the SOR pre onditioning, subroutine fspsolve,
and in the solution of matrix-ve tor produ ts, subroutine fsmatve . Moreover all
the operations whi h were des ribed as \global" need some ommuni ations among
pro essors. Thisapproa hshould provide again inperforman e, betterthanthe rst
alternative and atleast similarwith respe t tothe latterdis ussed. Atea hiteration
of the linear solver the setting of boundary onditions is alled on e or twi e (in
the ase of SOR pre onditioning) and other global operations are needed. This is
an expensive omputational ost in a parallel al ulation, but sin e fs g represents
about the 50% ofthe omputation,it isexpe ted notto be veryrelevant, aslong as
the blo ks remain large enough (or the ratio surfa e=volume, whi h is a measure
of omuni ation versus omputation, is small).
4 Code Parallelization
In order to de ide the kind of approa h for the ode parallelization, sometests with
dierent linear solving algorithms were made. The failure of these algorithms and
the stru ture of the ode have suggested to follow the approa h #3. The ode is
parallelized in a blo kto pro essor philosophy with the help ofMPImessage passing
libraries. So the pro essors will exe ute the workmade by ea hblo k inparallel, but
at ea h time step and at ea h linear iteration all the boundary onditions are to be
set and allthe global quantities are al ulated by the use of messagepassing.
In this se tion all the important hanges made in the ode to a hieve the paral-
lelization will be des ribed.
4.1 Initialization
In the main program the MPI initialization is made. It means, simply, to write the
three fortran lines:
CALL MPI INIT(error)
CALL MPI COMM SIZE(MPI COMM WORLD;NPE;error)
CALL MPI COMM RANK(MPI COMM WORLD;MYPE;error)
In init the input data are read from les. Be ause of the blo k to pro essor
approa h every pro essor must perform an independent omputation and so all of
them must read the same data. A number of input les equal to the number of
pro essors is ne essary to an begin the al ulation.
In the subroutine is alled the routine readb . In a parallel ase onne tions
betweenpro essorswillbein luded. Moreinformationisrequestedthanthoseneeded
by a blo k onne tion.
itype des ription b var(1) b var(2) b var(3) b var(4) b var(5)
1-200 Blo k Conne tion i si i s1 i s2 in 1 in 2
400-600 Pe Conne tion i si i s1 i s2 in 1 in 2
itype des ription b var(6) b var(7) b var(8) b var(9)
1-200 Blo k Conne tion - - - -
400-600 Pe Conne tion idsend idre v N1 N2
Table 2: Boundary onditions input data.
The odi ation ofboundary onditionsforblo k onne tionand pro essors on-
ghost cells
PE2 PE1
PE2 PE1
Receiving Sending
Sending Receiving
PE1
PE2
Figure 20: Data ex hange between two pro essors.
Five integer index are assigned to set the onne ted boundary onditions. The
onne ted side (i si) of the onne ted blo k is identied by the value of b var(1).
The values of b var(2) to b var(5) identify the starting indi es (i s1,i s2)and the
in rements (in 1,in 2) in the two dire tions of the side.
For onne tion betweentwo pro essors, the rst veindex havethe same mean-
ing. In addition other four index are assigned. Infa t for the su ess of the sending
andre eivingoperations sometagsarerequestedwhi hidentifythemessages. These
tags are read from input and they will remain the same for all the omputation.
b var(6) is the tag assigned to the message to send and b var(7) to the message
tobe re eived. b var(8) andb var(9)arethe numberofnodes inthetwodire tion
of the side.
In subroutine allo all the pointers are dened. In this subroutine the number of
ells ofea h pro essor is also al ulated. In the parallel aseit is usefulto havealso
denedthe totalnumberof ellsinthewhole domain. Forthiss opea olle tive mpi
routine is alled that returns to all the pro essors this information. The mpiroutine
is alled MPI ALLREDUCE the syntaxis:
all mpi allredu e(input;output; ount;datatype;operation; omm;error)
In this ase the input data are the n ell value of ea h blo k, the output is the
sum of them and the operation is mpi sum.
After the mesh is read the layer of dummy oordinates is set up. This is made
in subroutine dum oo.f. In the parallel ase it is needed a data ex hange between
near pro essors. Itis made as in a normal blo k onne tion, but the data ex hange
is made by using message passing as in gure 20.
The operation is divided in two parts.
First the ea h pro essor prepares in a buer the array with the information to
input input data
output output data
ount number of elements in send buer
datatype datatype of ea h send buer element
operation kind of operation to ompute
omm ommuni ator
error error number
that is alled for sending the messageis MPI IBSEND and syntax is:
all mpi ibsend(buf; ount;datatype;dest;tag; omm;request;error)
Se ond the pro essors re eive the messages and store the new information. The
steps are explained in gure 22. The subroutine alled for the re eiving operation is
MPI RECV with syntax:
all mpi re v(buf; ount;datatype;dest;tag; omm;error)
The arguments meaning are des ribed in table 3.
buf initial address of send buer
ount number of elements in send buer
datatype datatype of ea h send buer element
dest rank of destination
tag message tag
omm ommuni ator
request ontrol number
error error number
Table 3: Arguments in sending and re eiving operation
4.2 Solver
Boundary onditionsfor onne tionamongpro essorsaretreatedinthesamemanner
as for the dummy oordinates. This operation is repeated for all the dependent
variables.
As shown ingure 16inside the linear solver the boundary onditionsare setand
further someglobal operations are made.
The global operations, des ribed in se tion 2.5, are Eu lidean norms performed
2 1
2
1
2
SEND SEND
1
BUFFERS
Figure 21: Sending operation.
RECEIVING
RECEIVING 1
2
1
2 STORED
Figure 22: re eiving operation.
operations is not onserved numeri ally, be ause of the trun ation error; in some
ases the hange of the order an arry big dieren es in the result.
To rea h the same results the pro edure is divided in two parts, see gure 23.
First, ea hpro essor makes its ownserial omputation. Thenall the dataare putin
a single array in the rst pro essor exa tly as it would be in aserial al ulation with
n blo ks. This operation is possible in parallel by alling a mpi olle tive subroutine
alled MPI GATHERV.Pro essor#1performsthe normhavingallthedataneeded
and in the same order that in the serial ase. Of ourse the norm performed in this
way is safe for the result, but is very expensive in terms of omputational time. In
fa t the norm is made in serial way, so as with a single pro essor, and moreover a
olle tive mpi operation is alled. This is an intrinsi limitationof this kind of algo-
pe1 pe2 peN
data pe1
data
Figure 23: Norm in parallel.
was implementedin order to improveperforman es . The norms are al ulated sep-
arately by ea h pro essor and the global norm is obtained summing the results with
the help of mpi routine MPI REDUCE. it does not maintain the serial order; the
routineMPI REDUCEhasthesamesyntaxofroutineMPI ALLREDUCE,theonly
dieren e is that MPI REDUCE returns the result only to one pro essor. From a
mathemati al point of view the two methods are equivalent but not numeri ally. In
all test ases performed the dieren es appear not important and the results are
orre ts, with a powerful gain in performan e.
4.3 Convergen e
Finallythe historyof onvergen e iswrittenea htimestep. The maximumresidual,
the rms value and the position of the maximum residual are al ulated for ea h
dependent variable. These are global quantities so mpi olle tive operations are
required for omputing their orre t values. This is made in subroutine glob on
alled infssolver. Themaximumare al ulatedlo allyandthen themaximumamong
all the pro essors is hosen with its position (also the pro essor). The rms value is
omputed summing the ea h pro essor value. The mpi routine alled to do these
operations is MPI REDUCE. The operation omputed inside the mpi routine is
always mpi sum.
5 Validation
At this stage of development the test ases performed had to assure the orre t
workingofthe ode. Inparti ulartwoboundarylayer ases(alaminarandaturbulent
atplate),andaMoreau ombustion hamber. Allthe asesare omputedwithboth
the algorithms available to solve the linear system,the gs and the bi- gstab. In all
the ases the results are exa tly equal to those obtained with the serial simulation.
In gures 24, 25 and 26 are reported the onvergen e history in the three ases. It
is not possible to distinguish between the serial and the parallel simulations.
The speed-up of the ode is studied in one of the test ase. The Moreau om-
bustion hamberwas analyzedvaryingthe numberof pro essors from #1 to #8. In
the table 4 the result of the serial ode against those of parallel one is shown.
pro . Cpu time SpeedUp
1(2 blo ks) 438.26s 1.0
2(2blo ks) 230.17s 1.91
1(4blo ks) 482.29s 1.0
4(4blo ks) 189.01s 2.56
1(8blo ks) 535.97s 1.0
8(8blo ks) 135.24s 3.96
Table 4: Performan es
In the rst olumnthe number ofpro essors used is shown. The pu timegrows
with the number of blo ks due to the in reased omuni ation work inside the linear
system algorithm; this is the reason why a number of 1-pro essor runs have been
made with dierent number of blo ks. Then pu-times and speed-up fa tors are
shown. From the results it is possible to see that the performan e of the ode,
as expe ted, presents some s alability but the speed-up is far from the ideal linear
speed-up. The performan e of the parallel implementation will be assessed in more
details in the oming report about the overall validation of the ode.
0 50 100 150 200 iteration
−8
−6
−4
−2 0
u−residual
single pe parallel
Figure 24: Convergen e in the laminar atplate.
0 100 200 300 400 500
iteration
−4
−3
−2
−1 0
u−residual
single pe parallel
Figure 25: Convergen e in the turbulent atplate.
0 100 200 300 400
iteration
−8
−6
−4
−2 0
u−residual
single pe
parallel
A Input and Output les
Manyinformationsaboutthekindofsimulationthe odehaveto omputearewritten
in input les. The input and output les follow the form
prex:ext
where prex is the name of the test- ase. In the subroutine init these les are open
and read from the le PREFIX, and ext orresponds to the following extensions:
dat;msh;ini; on;out;te ;log:
In the ase of rea ting al ulation the les prex.pre and prex.di an be also
read. Oneor more lesprex.jetn should be present ifexternal velo ity or turbulent
proles are used.
A.1 The dat le
Inputdataareread fromthe datle. Hereare allthe informationsneeded to hoose
the kind of simulation ( hoi e of algorithms, model of turbulen e and ombustion,
geometri al informations et .). On e the dat le has been read, if requested by the
hosen model of ombustion , other input les are read with the right value of the
parameters used in the model.
A.2 The b le
The le with suÆx b ontains all the informations needed to a orre t treatment
of boundary onditions. The dimensions of grid blo ks are read. The indi es for all
sides and segmentsare assigned. The kind of onditions and the number of ells for
all the boundaries are read and stored in integer arrays. At this level the proles of
velo ity and of turbulen e variables are read if they are assigned.
Many kind of boundary onditions are allowed. There are denitions for dierent
adiabati walls,inlet andoutletproles. Thereare alltheparametersneeded inblo k
onne tions. The multi-blo k stru ture requests a hange of data between near
blo ks to asses the orre t value of residuals on the boundary. The data ex hange
is done in only one step. The memory lo ations orresponding to two layers of
ghost ellsare lled with values ofthe orresponding innereld dependent variables.
Theserepresentalltheinformationsneededto omputethe uxesthrough onne ted
boundaries.
Using the informations ontained in dat and b the work data stru ture is set
A.3 The msh le
The mesh is read from the binary le msh. To treat orre tly the boundary two
layers of dummy oordinates are onstru ted around the boundaries of ea h blo k
using the value of oordinates and normals of the boundary ells.
This is made in the subroutine dum oo. All volumes and normals, also for the
dummy ells, are set.
A.4 The on le
Be ause the simulations that mustbe done are almost always stationary, the al u-
lation is expe ted to go to onvergen e after a number of iteration. It is important
to monitor this behavior in order to de ide when and if the omputing is gone to
su ess. The most ommon hoi e is to follow the advan ing of the right hand side
of themomentequations hanged in sign,whi h, in astationary al ulation, goes to
zeroiteration after iteration. This hand ofequations are the uxesof the dependent
variables whi h are alled also residuals. In ode Ares the residuals of all variables
usedinthe al ulationare stored. Intherst3iterationtheir absolutevalueisstored
andwritteninle. on. The thirdistakenasreferen eresidualres
0
forthe following
iteration. At the iteration i the residuals are related to the referen e value as
log
10 (
res
i
res
0 );
this expression is evaluated for the sake of simpli ity. The al ulation is onsidered
gone to engineering onvergen e when this expression is less then 10 5
.
With the value of normalized residuals their lo ation in the grid is also written.
For ea h variable the indi es i;j;k of the ell and the blo k where the maximum
absolute residual is found are written.
A.5 Restart simulations
When the simulation is a restart of an older one, the given solution is read from
the ini le, that ontains the solution of a previous al ulation. The ode mustalso
know the onvergen e history of the previous al ulation. Hen e the on le of the
previousrunmustbepresent. The onle ontainsallthe onvergen einformations,
the last iteration, the time step and the valuesof maxresiduals.
A.6 The Output Files
The output les are opened at rst iteration. The values, in every ell of the om-
putational domain, of the eld variablesatthe end of arun session are stored inthe
The te le is an unformatted le intended for a su essive management of
the solution data (visualization, further elaboration, et .). In the te le the mesh
oordinates and the value of elds variables are stored.
The log le is an as ii le reporting a summary of the run hara teristi s.
Referen es
[1℄ Mulas, M.,Beeri, Z.,Golby, D.,Surridge, M., and Tali e, M. "A parallel Navier-
Stokes ode for large industrial appli ations". P. Kutter et al. (eds), Le ture
Notes in Physi s, Springer 1997, pagg. 450-455.
[2℄ Mulas, M., Chibbaro, S., Delussu, G., Di Piazza, I., and Tali e, M. "The CFD
ode Karalis". CRS4-TECH-REP 00/87, 2000.
[3℄ Met alf,M,Reid,J. FORTRAN90/95explained. OxfordUniversityPress,1996.
[4℄ XL FORTRAN for AIX, Language Referen e. IBM, 1994.
[5℄ XL FORTRAN for AIX, User's Guide. IBM, 1994.
[6℄ Appunti s uola di parallelismo del CINECA. CINECA, 1998.
[7℄ MPI:A Message-Passing Interfa e Standard. 1995.