A unified GPU-CPU aeroelastic compressible URANS solver for aeronautical, turbomachinery and open rotors applications

(1)

POLITECNICO DIMILANO

DIPARTIMENTO DISCIENZE ETECNOLOGIE AEROSPAZIALI (DAER)

DOCTORALPROGRAMME IN AEROSPACEENGINEERING

A

UNIFIED

GPU-CPU

AEROELASTIC COMPRESSIBLE

URANS

SOLVER FOR AERONAUTICAL

,

TURBOMACHINERY AND OPEN ROTORS APPLICATIONS

Doctoral Dissertation of:

Andrea Gadda

Supervisor:

Prof. Paolo Mantegazza

Co-Supervisor:

Dr. Giulio Romanelli

Tutor:

Prof. Alberto Matteo Attilio Guardone

The Chair of the Doctoral Program:

Prof. Luigi Vigevano

(2)

(3)

I would like to thank my supervisor, Prof. Paolo Mantegazza for this opportunity. Thanks to him I could combine my passion for computer science to my work.

I would like to thank Dr. Giulio Romanelli for his help and support during these 3 years of PhD. I would like to thank him for the collaboration regarding AeroX devel-opment and for his advice.

I would like to thank Prof. Luca Mangani and Prof. Ernesto Casartelli from HSLU for their support and collaboration for what concerns AeroX development, especially regarding turbomachinery. I would like to thank them for the hospitality during my period abroad at HSLU.

I would like to thank Dr. Andrea Parrinello and Dr. Davide Prederi for their help and the collaboration concerning open rotors.

I would like to thank Prof. Marco Morandini for his help during these years at Politecnico, especially for what concerns programming and computer science.

I would like to thank all the friends met during these years at Politecnico di Milano, especially Marco, Michela, Fonte, Sara, Desa, Teo, Luca, Mattia, Zaga and all the new crazy friends at the office, Pietro, Davide, Simone, Mattia, Farooq, Aureliano, Malik, Paolo, Mirco, Alessia and all the others.

I would like to thank all my friends from the TAV (TAVerna) for all these years since high-school. Thank you Branza, Teo, Cislo, Giulia, Baro, Pianta, Devix, Mirio, Carbo and all the others! Too much memories!

Finally I would like to thank my parents and my sister, Alessia, for their support during these years. I would like to thank all the other members of my family for all their support.

(4)

(5)

Abstract

F

Or the aerodynamic design of aeronautical components Computational Fluid Dy-namics (CFD) plays a fundamental role. Pure CFD analyses are usually suffi-ciently accurate for a wide range of problems. However, when the deformability of the structure cannot be neglected or rigidly moving parts appear in the fluid domain, different disciplines (such as Fluid-Structure Interaction), methodologies (such as Fi-nite Element Method) and strategies (such as Multibody System Dynamics) are also required.

Beside the usual aeronautical examples where an accurate study of the interaction between the fluid and the structure is a key part of the design process (e.g. wings, aircraft, helicopter blades), another important field is represented by turbomachinery, where in particular in the literature aeroelastic investigations are not widely performed yet. A recent research trend is also represented by open rotors and propfans.

Together with the availability of more and more powerful computing resources, cur-rent trends pursue the adoption of such high-fidelity tools and state-of-the-art technol-ogy even in the preliminary design phases. Within such a framework Graphical Pro-cessing Units (GPUs) yield further growth potential, allowing a significant reduction of CFD process turn-around times at relatively low costs.

The target of the present work is to illustrate the design and implementation of an ex-plicit density-based URANS coupled aeroelastic solver, called AeroX, for the efficient and accurate numerical simulation of multi-dimensional time-dependent compressible fluid flows on polyhedral unstructured meshes. Turbomachinery and open rotors exten-sions are also implemented to handle complex compressor, turbine and propfan cases. The solver has been developed within the object-oriented OpenFOAM framework, us-ing OpenCL for GPGPU programmus-ing and CPU-GPU interfacus-ing. Different conver-gence acceleration techniques, such as Multi Grid and Local Time Stepping, are im-plemented and opportunely tuned for GPU executions in order to allow an implicit-like residuals convergence. Dual Time Stepping is also implemented to allow time-accurate simulations of unsteady cases of aeronautical interest, such as wings and blades flut-ter. For what concerns aeroelasticity, Radial Basis Functions are employed to interface the aerodynamic and the structural meshes. The modal representation of the struc-tural behavior is adopted thanks to its accuracy and computational efficiency. Inverse Distance Weighting is used to update the aerodynamic mesh points knowing the wall displacements. The solver is specifically designed to exploit cheap gaming GPU

(6)

archi-tectures which exhibit high single precision computational power but a limited amount of global memory. Equations are solved in a non-dimensional form to reduce numer-ical errors. The solver is also natively compatible with more expensive HPC GPUs, allowing the exploitation of their high double precision computational power and their higher amount of memory. Thanks to OpenCL, AeroX is also natively compatible with multi-thread CPU executions.

The credibility of the proposed CFD solver is assessed by tackling a number of aeronautical, turbomachinery and open rotor benchmark test problems including the 2nd Drag Prediction Workhop, the 2nd Aeroelastic Prediction Workshop (AePW2), the HiReNASD wing, the AGARD 445 wing, the NASA’s Rotor 67 blade, the 2D/3D Standard Configuration 10 blades, the Aachen turbine and the SR-5 propfan blade. The recent AePW2 benchmark case, in particular, proves that AeroX is capable to predict flutter with an accuracy level that is comparable with the state-of-the-art aeroelastic compressible URANS solvers, requiring just a cheap gaming GPU. In the literature it is difficult to find static aerelastic investigations of turbomachinery blades. Thus, the trim of the NASA’s Rotor 67 fan blade is here investigated, showing that the high blade stiffness is responsible for the very small wall displacements. This is translated in negligible differences between the aeroelastic and the purely aerodynamic solutions for such configurations.

The focus of this work is also on computational aspects. With AeroX an average one order of magnitude speed-up factor is obtained when comparing CPUs and GPUs of the same price range.

(7)

Sommario

L

Afluidodinamica computazionale (CFD) costituisce un ruolo fondamentale per la progettazione di componenti aeronautici. Solitamente analisi puramente aerodi-namiche sono sufficienti per una vasta gamma di problemi. Tuttavia, quando la deformabilità della struttura non può essere trascurata oppure quando nel dominio flui-do sono presenti parti rigide in movimento, altre discipline (come l’interazione fluiflui-do- fluido-struttura), metodi (come il metodo agli elementi finiti) e strategie (come la dinamica dei sistemi multi-corpo), risultano necessarie.

Accanto ai soliti esempi aeronautici dove un accurato studio dell’interazione tra il fluido e la struttura è un punto chiave nel processo di progettazione (ad es. ali, interi aerei, pale di elicottero), un altro campo è rappresentato dalle turbomacchine, dove in letteratura analisi aeroelastiche statiche non sono ancora ampiamente effettuate. Un trend recente è inoltre rappresentato dagli open rotor e dai propfan.

Assieme alla disponibilità di risorse di calcolo sempre più potenti, l’idea attuale è quella di adottare strumenti in grado di restituire soluzioni accurate già nelle fasi preliminari di progettazione. All’interno di questo concetto le schede grafiche (GPU) permettono una significante riduzione dei tempi di calcolo a costi relativamente bassi.

Lo scopo di questo lavoro è quello di illustrare la progettazione e implementa-zione di un solutore aeroelastico esplicito, comprimibile, viscoso (URANS), chiama-to AeroX, adatchiama-to alla simulazione efficiente ed accurata di casi instazionari e multi-dimensionali, compatibile con mesh poliedriche non strutturate. Nel solutore sono an-che implementate estensioni riguardanti turbomacchine e open rotor per poter gestire casi di compressori, turbine e propfan. Il solutore è stato sviluppato nel contesto del-l’ambiente orientato ad oggetti OpenFOAM, usando OpenCL per la programmazione GPGPU e per interfaccia CPU-GPU. Diverse tecniche di accelerazione della conver-genza, quali Multi Grid e Local Time Stepping, sono implementate e ottimizzate per esecuzioni su GPU in modo da ottenere andamenti di convergenza simili a un solutore implicito. Inoltre, il Dual Time Stepping è implementato poter sfruttare queste tecni-che antecni-che con casi instazionari di interesse aeronautico come il flutter di ali e palette. Per quanto riguarda l’aeroelasticità, le Radial Basis Function sono utilizzate per inter-facciare mesh strutturali e aerodinamiche. La rappresentazione modale del comporta-mento strutturale è adottata per via della sua accuratezza ed efficienza computazionale. L’Inverse Distance Weighting è usato per aggiornare la posizione dei punti della mesh aerodinamica sulla base degli spostamenti della parete. Il solutore è progettato per

(8)

sfruttare le architetture delle GPU da gioco che sono caratterizzate da un’elevata po-tenza di calcolo in singola precisione ma una limitata quantità di memoria globale. Le equazioni sono risolte in forma adimensionale per ridurre gli errori numerici. Il solu-tore è inoltre nativamente compatibile con le più costose GPU da HPC, permettendo di sfruttarne l’elevata potenza di calcolo in doppia precisione e la maggiore quantità di me-moria. Grazie a OpenCL il solutore è inoltre nativamente compatibile con l’esecuzione multi-thread su CPU.

Il solutore è stato validato con diversi casi aeronautici, di turbomacchine e open ro-tors come il 2nd Drag Prediction Workshop, il 2nd Aeroelastic Prediction Workshop (AePW2), l’ala HiReNASD, l’ala AGARD 445, la pala del Rotor 67 della NASA, la pala della Standard Configuration 10 (2D e 3D), la turbina Aachen e la pala del prop-fan SR-5. Il recente benchmark AePW2, in particolare, prova che AeroX è capace di completare analisi di flutter con un livello di accuratezza comparabile a quello fornito dallo stato dell’arte dei solutori comprimibili URANS aeroelastici, richiedendo sempli-cemente l’uso di un’economica GPU da gioco. In letteratura è difficile trovare analisi aeroelastiche statiche di palette di turbomacchine. In questo lavoro è quindi stata ef-fettuata l’analisi di trim della pala del Rotor 67, mostrando che la sua elevata rigidezza è il motivo per cui i suoi spostamenti sono ridotti e di conseguenza le differenze tra soluzioni aeroelastiche e soluzioni puramente aerodinamiche sono trascurabili.

In questo lavoro l’attenzione è stata posta anche sugli aspetti computazionali. Uno speed-up medio di un ordine di grandezza è stato ottenuto confrontando CPU e GPU della stessa fascia di prezzo.

(9)

2 Fluid dynamics and aeroelastic system formulations 23 2.1 Aerodynamics formulations . . . 23 2.1.1 Navier–Stokes equations . . . 24 2.1.2 Euler equations . . . 28 2.1.3 ALE formulation . . . 29 2.1.4 Numerical discretization . . . 31 2.1.5 Convective fluxes . . . 33 2.1.6 Gradients computation . . . 33 2.1.7 Turbulence models . . . 34 2.1.8 Boundary conditions . . . 34 2.1.9 Wall treatment . . . 35

2.1.10 Convergence acceleration techniques . . . 35

2.1.11 Temporal discretization for unsteady simulations . . . 40

2.1.12 Aerodynamic steady analyses . . . 41

2.1.13 Aerodynamic unsteady analyses . . . 43

2.2 Aeroelasticity . . . 43

2.2.1 Aeroelastic system . . . 43

2.2.2 Aerodynamic transfer function matrix . . . 46

2.2.3 Aeroelastic system stability and flutter . . . 50

2.2.4 Aeroelastic interface . . . 51

2.2.5 Moving Boundaries . . . 53

2.2.6 Aerodynamic mesh internal nodes update . . . 54

2.2.7 Transpiration boundary conditions . . . 56

2.2.8 Trim analyses . . . 56

2.2.9 Forced oscillations analyses . . . 61

(10)

3 Turbomachinery and Open Rotors extensions 67

3.1 Turbomachinery and open rotors . . . 68

3.2 Aerodynamics and modelling . . . 73

3.2.1 Time linearized approach . . . 74

3.2.2 Harmonic balance . . . 74

3.3 Turbomachinery performance map . . . 76

3.4 Turbomachinery aeroelasticity . . . 77

3.5 Turbomachinery and open rotor formulations . . . 81

3.6 Moving Reference of Frame . . . 81

3.6.1 Exploiting ALE formulation . . . 82

3.6.2 Source terms . . . 82

3.6.3 Few considerations . . . 82

3.7 Cyclic boundary conditions . . . 83

3.8 IBPA and time-delayed boundary conditions . . . 84

3.9 Total pressure and temperature inlet boundary conditions . . . 86

4 GPGPU 89 4.1 History of GPGPU . . . 89

4.2 CPU vs GPU architectures . . . 90

4.2.1 When using GPGPU . . . 94

4.2.2 Gaming GPUs . . . 95

4.3 Advantages and drawbacks of GPGPU . . . 99

4.3.1 Problem size . . . 99

4.3.2 Branch divergence . . . 99

4.3.3 Memory coalescing . . . 101

4.3.4 Debugging and profiling . . . 102

4.4 OpenCL . . . 103

4.4.1 OpenCL work subdivision . . . 106

4.4.2 OpenCL memory model and consistency . . . 108

4.4.3 OpenCL code example . . . 110

5 GPU implementation 113 5.1 Solver programming language and libraries . . . 113

5.1.1 OpenFOAM . . . 114

5.1.2 OpenCL . . . 115

5.1.3 Interfacing OpenCL and OpenFOAM . . . 116

5.2 Solver architecture details . . . 119

5.2.1 Overall scheme . . . 119

5.2.2 Convergence check . . . 120

5.2.3 Numerical tricks for single precision . . . 121

5.2.4 Debugging the device code . . . 124

5.2.5 Profiling the device code . . . 126

5.3 Algorithms and formulations implementation . . . 126

5.3.1 Local Time Stepping and computationally similar kernels . . . . 127

5.3.2 Convective Fluxes for internal faces . . . 128

5.3.3 Wall treatment . . . 133

(11)

Contents

5.3.5 Viscous fluxes . . . 137

5.3.6 Residual assembly . . . 142

5.3.7 Source terms . . . 143

5.3.8 Convergence acceleration techniques . . . 143

5.3.9 Solution update . . . 144

5.3.10 ALE and MRF . . . 145

5.3.11 Mesh deformation . . . 147

5.3.12 Cyclic boundary conditions . . . 150

5.3.13 Delayed periodic boundary conditions . . . 151

6 Computational Benchmarks 153 6.1 Hardware aspects . . . 153

6.1.1 GPUs . . . 154

6.1.2 CPUs . . . 155

6.1.3 APUs . . . 156

6.2 Benchmark cases and results . . . 156

6.2.1 Overall speed-up and multi-thread scalability . . . 156

6.2.2 Kernels speed-up . . . 160

6.2.3 Mesh dependency . . . 163

6.2.4 SP vs DP and ECC memory validations . . . 166

7 Fixed wing aerodynamic applications 171 7.1 Onera M6 . . . 171

7.2 RAE . . . 172

7.3 2nd Drag Prediction Workshop . . . 176

8 Fixed wing aeroelastic applications 183 8.1 HiReNASD wing trim . . . 183

8.1.1 Structural model . . . 184

8.1.2 Aerodynamic model . . . 185

8.1.3 Trim analysis . . . 186

8.2 AGARD 445.6 wing flutter . . . 189

8.2.3 Trim analysis . . . 193

8.2.4 Flutter Analysis . . . 194

8.3 2nd Aeroelastic Prediction Workshop wing flutter . . . 195

8.3.3 Trim results . . . 199

8.3.4 Flutter results . . . 202

9 Turbomachinery and open rotor blades aerodynamic applications 209 9.1 Goldman turbine blade . . . 210

9.2 Aachen 1.5 stages axial turbine . . . 211

(12)

10 Turbomachinery and open rotor blades aeroelastic applications 219

10.1 SC10 2D aerodynamic damping . . . 219

10.1.1 Steady results . . . 222

10.1.2 Aerodynamic damping results . . . 222

10.2 SC10 3D aerodynamic damping . . . 225

10.2.1 Steady results . . . 226

10.2.2 Aerodynamic damping results . . . 229

10.3 NASA Rotor 67 trim . . . 230

10.3.3 Trim results . . . 233

10.4 Open rotor blade flutter . . . 237

10.4.3 Trim results . . . 241

10.4.4 Flutter results . . . 243

11 Concluding Remarks 245 A Introduction to parallel computing 249 A.1 Introduction to parallel computing . . . 249

A.2 The GPGPU way . . . 250

A.3 Flynn Taxonomy . . . 252

A.4 Parallelization strategies overview . . . 255

A.4.1 SIMD Extensions . . . 256

A.4.2 Shared memory system and multi-threading . . . 256

A.4.3 Distributed memory systems . . . 261

A.4.4 Hybrid and heterogeneous systems . . . 262

A.5 Performance aspects . . . 263

Bibliography 267

(13)

Nomenclature

AePW Aeroelastic Prediction Workshop ALE Arbitrary Lagrangian Eulerian AMI Arbitrary Mesh Interface

API Application Programming Interface APU Accelerated Processing Unit

AVX Advanced Vector Extension BC Boundary Conditions

BDF Backward Differentiation Formula BSCW Benchmark Super-Critical Wing BZT Bethe–Zel’dovich–Thompson CC Cell Centered

CC-NUMA Cache Coherent Non Uniform Memory Access CC-UMA Cache Coherent Uniform Memory Access CFD Computational Fluid Dynamics

CFL Courant–Friedrichs–Lewy condition CN Crank–Nicolson

CPU Central Processing Unit CROR Counter Rotating Open Rotor CU Compute Unit

(14)

DDR Double Data Rate

DES Detached Eddy Simulation DNS Direct Numerical Simulation DP Double Precision

DPW Drag Prediction Workshop DTS Dual Time Stepping ECC Error Correcting Code EE Explicit Euler

FEA Finite Element Analysis FEM Finite Element Method FFT Fast Fourier Transform

FLOPS FLoating point Operations Per Second FPGA Field-Programmable Gate Array

FRF Frequency Response Function FSI Fluid Structure Interaction FV Finite Volume

FVM Finite Volume Method GDDR Graphics Double Data Rate

GPGPU General-Purpose computing on Graphics Processing Units GPU Graphical Processing Unit

GUI Graphical User Interface GVT Ground Vibration Tests HB Harmonic Balance

HBM High Bandwidth Memory

HiReNASD High Reynolds Number AeroStructural Dynamics HPC High Performance Computing

HR High Resolution

IBPA Inter-Blade Phase Angle IC Initial Conditions

(15)

Contents

IDW Inverse Distance Weighting IE Implicit Euler

LES Large Eddy Simulation LS Least Square

LTS Local Time Stepping MBS Multi-Body System MG Multi-Grid

MIMD Multiple Instruction Multiple Data MISD Multiple Instruction Multiple Data ML Mixing Length

MLS Moving Least Square MP Mixing Plane

MRF Moving Reference of Frame NLFP Non Linear Full Potential NS Navier–Stokes

NUMA Non Uniform Memory Access ODE Ordinary Differential Equation ORC Organic Rankine Cycle

OTT Oscillating TurnTable PAPA Pitch And Plunge Apparatus PDE Partial Differential Equation PE Processing Element

PIG Polytropic Ideal Gas R67 Rotor 67

RAM Random Access Memory

RANS Reynolds Average Navier–Stokes RBF Radial Basis Function

RK Runge–Kutta

ROM Reduced Order Model RPM Revolutions Per Minute

(16)

RS Residual Smoothing

SA Spalart-Allmaras turbulence model SC Standard Configuration

SDK Software Development Kit SIMD Single Instruction Multiple Data SIMT Single Instruction Multiple Thread SISD Single Instruction Single Data SM Streaming Multiprocessor SMP Symmetric Multi-Processing SP Single Precision

SPH Smooth Particle Hydrodynamics SSE Streaming SIMD Extensions SST Shear Stress Transport SU Speed-Up

TDP Thermal Design Power TDT Transonic Dynamics Tunnel UMA Uniform Memory Access

URANS Unsteady Reynolds Average Navier–Stokes USD US Dollars

(17)

CHAPTER

1

Introduction

The aim of this first chapter is to introduce the fundamental concepts that motivate this work. The reader will be provided with an overview of the current state-of-art approaches in the numerical aeroelastic analyses of typical aeronautical cases. An important part of this work is also dedicated to the role played by GPUs in acceler-ating the solver computations. Thus, a brief introduction of the modern technologies adopted to accelerate numerical simulations is showed. Exploiting General Purpose GPU (GPGPU) to reduce the simulation times is in fact a very relevant trend in a wide range of numerical applications, from CFD to finance and cryptography. One of the aim of this work is to build a general purpose solver, called AeroX, that is also capable of performing turbomachinery and open rotors simulations thanks to dedicated exten-sions. Finally the structure of the thesis is presented alongside a brief introduction of the most important concepts presented in each chapter. This work can be also viewed as a continuation of what started by Romanelli and Serioli [136] and Romanelli [127] with the AeroFoam solver, pursuing the goal of obtaining fast and accurate solutions at the very beginning of the design phases of the aeronautical component.

1.1

Background, CFD/FSI and HPC state-of-the-art

Computational Fluid Dynamics (CFD) is nowadays a fundamental tool for the aerody-namic design in the aeronautical field. CFD allows to simulate almost every kind of aeronautical component, from simple airfoils to entire jet fighters.

In CFD, like in every other kind of simulation, three fundamental aspects must be considered: the mathematical/physical modelization of the reality, the discretization of the problem through numerical formulations and the computational side.

(18)

case, different formulations and algorithms can be used to obtain the final solution. Obviously as more and more physical effects are modeled, the simulation cost in term of computational time is increased. Thus, the usual approach in every engineering field is based on an iterative process. Computationally inexpensive formulations are adopted for the initial design of the aerodynamic component when high results accuracy is not necessary. As the development of the particular component progresses, more accurate numerical and experimental results are required in order to verify that the performance and efficiency will reach the prefixed target. Often, a final optimization loop is adopted in order to find the best parameters that satisfy all the project requirements.

As said, accurate formulations are computationally expensive, thus it is the engi-neer’s job to figure out the perfect trade-off between results accuracy and simulation times. Results accuracy and computational power are always related. When CFD was born the computational power of a personal computer was orders of magnitude lower than what we can find today in a cheap smartphone. Nowadays it is basically possi-ble to buy a true computer for 5$ (Raspberry Pi Zero [30]) with a 1 GHz ARM CPU. Obviously the first approaches to CFD were largely restricted by the limited amount of available FLOPS (Floating Point Operations Per Second) and memory. Histori-cally the first adopted methods were represented by Doublet Lattice Method (DLM) by Morino, exploiting a linearization of the aerodynamic problem under small distur-bances hypotheis. Full potential formulations where implemented firstly in [57, 83] ’70 eventually in a finite volume framework [84] (1977). These methods are now relatively inexpensive and can be adequate when it can be safely hypotized that strong nonlinear effects such as shocks and separations do not appear in the flow. It must be noted that using a Non Linear Full Potential (NLFP) formulation [66, 115, 116] it is easily possi-ble to handle cases with weak shocks, when they are not strong enough to invalidate the isoentropic hypotesis. However, a state-of-the-art formulation [115, 122] for the NLFP can be adopted to handle this occurrence. These methods are still used to pro-vide, today in matter of seconds/minutes [66] on desktop computers, a general idea of the performances that can be provided by an airfoil/wing/rotor/aircraft. After the initial design decisions, usually performed with a workstation, it is then possible to perform a round off by running more accurate and computationally expensive simulations on more powerful cluster computers.

Historically as more and more accurate results were required and higher computa-tional power became available, potential methods were surpassed by the Euler formu-lation in 1980s. Again, viscous effects are neglected, however strong compressible non linear effects given by shock waves can be modeled. Euler methods are an order of magnitude faster with respect to more expensive compressible viscous simulations and usually provide enough accurate results when it is known that strong viscous effects are not likely to occur in the particular test case under analysis [130, 131, 136]. Thus, compressible Euler formulations can be viewed as an alternative to the NLFP or panel methods for the initial design phases of the aeronautical component. It must be noted however that in a compressible Euler approach we have to deal with all the 5 conser-vative variables (density, momentum vector, and total energy), while in a one-field or two-field NLFP approach it reduces down to only one or two variables (velocity po-tential or density and velocity popo-tential) with obvious advantages in term of memory consumption and simulation times.

(19)

1.1. Background, CFD/FSI and HPC state-of-the-art

The next logical step, following both the computers evolution and the typical engi-neering work flow for an aeronautical component design, is the introduction of a way to model viscous effects [120]. Directly solving the full compressible Navier–Stokes (NS) equations in what is called Direct Numerical Simulation (DNS) is correct from a mathematical and physical point of view. However the computational cost of the DNS is today still prohibitive for a wide range of typical aeronautical cases of interests with high Reynolds numbers. The problem is not strictly related to the computational cost of solving the NS equations inside a single discretized entity of the continuum. The main drawback arises from the necessity of discretizing the fluid domain up to the smallest scales of the turbulence [120]. Nowadays with the available computational power DNS simulations are limited to relatively low Reynolds and peculiar cases. Of course the situation is likely to change in the future thanks to the research in both the mathemat-ical/numerical and computational sides. Currently two main alternatives to DNS are available when the continuum hypothesis is considered and an eulerian or ALE (Arbi-trary Lagrangian Eulerian) space formulations are adopted (thus excluding approaches like Boltzmann/Lattice-Boltzmann/SPH): (U)RANS and LES (and their combinations like DES/DDES). The idea behind (U)RANS, (Unsteady) Reynolds Average Navier– Stokes, resided in the "modelization" of the small scales turbulence effects, thus ex-cluding the necessity of their effective "resolution". This is usually done using the Boussinesq hypothesis [120] and eventually by solving additional partial differential equations (i.e. the turbulence equations associated to the particular chosen model). Since (U)RANS represents a modelization of the reality, different models were devel-oped in the last decades. Mixing Length [54], Spalart–Allamaras [140], k_{− ω [152],} k_{− [53], SST [107,108] are just few examples in a very rich literature. Some models} were developed and opportunely tuned for specific problems. As an example, k_{− ω} performs well in near wall regions, k_{− instead performs well far from the wall and in} free shear layers, Spalart–Allmaras is specifically designed for aeronautical cases with-out boundary layer separations (wings, airfoils in normal conditions). Mixing Length models (e.g. Smagorinsky [139] and Baldwin-Lomax [43]) are usually less accurate and more dissipative than one or two equations models, requiring damping functions like Van Driest [148] for the near wall regions. However, the main advantages of ML models is their relatively low computational requirements since no additional PDEs are required to compute the turbulent viscosity. Some models are meant to be general purpose, providing quite accurate results on a wide range of different cases. As an ex-ample, k_{− ω SST was developed in order to exploit the advantages offered by both} k _{− ω and k − models, allowing to simulate both near wall and far from the wall} flows. The literature offers many papers describing corrections and optimiziations for existing (U)RANS models in order to better describe particular cases (e.g. on [1] it is possible to see 7 variants just for the SST model). Sometimes (U)RANS models can be opportunely tuned with experimental results. Another important aspect is represented by the so called wall functions. In fact, usually (U)RANS models require a fluid do-main discretization up to the viscous sublayer (e.g. k_{− ω, k − ωSST and SA), where} the non-dimensional wall distance (y+) is in the order of 1. This, of course, is directly translated in a mesh refinement near the wall regions that, together with the costs given by the discretization of viscous terms and the resolution of the the turbulence equations, is the main reason of the greater computational costs with respects to an Euler

(20)

simula-tion. Wall functions and automatic wall treatment [88, 121] are developed to asses this problem by lowering the near-wall discretization requirements. This way it is possible to perform (U)RANS simulations using a mesh where the wall distance of the first cell is such that y+ > 30 obtaining at the same time accurate viscous effects. Automatic wall treatments can be also used to automatically switching on and off wall functions depending on the value of y+. This is done by using near-wall formulations in regions where the near-wall discretization allows to resolve the viscous sublayer and at the same time by using the log-law formulations in regions where the near wall discretiza-tion is reduced. Another advantage of this approach is that it is not needed an iterative procedure of building meshes until the one that exactly matches the y+values for which the adopted turbulence model is supposed to produce accurate results, over all the dis-cretized wall surface, is found. It must be noted, however, that turbulence is a strictly unsteady and 3D phenomena. Thus, using (U)RANS for steady and/or 2D cases means hiding further approximations. Furthermore (U)RANS represents a cheap modeliza-tion of the turbulence effects. Thus (U)RANS models have an intrinsic limited range of applicability. In particular, complex phenomena like separations, boundary layer-shock interactions and transition from laminar flows still represent a challenge for (U)RANS simulations. More complex models were also developed (e.g. RSM, Reynolds Stress Models) in order to improve results accuracy of (U)RANS simulations. In RSM models the idea is to discard the Boussinesq hypothesis and directly model each component of the Reynolds stress tensor. Of course this is translated in more expensive simulations due to the necessity to solve more turbulence equations than Spalart–Allmaras and SST models. Besides the fact that different turbulence models may perform well in some peculiar cases and poorly in other cases for which they are not opportunely tuned, when approaching a new case an important aspect is also given by user experience. For aero-nautical components like airfoils, wings, airplanes, turbomachinery blades and open rotor blades usually models like SST and SA are the first choice for a good trade-off between results accuracy and computational requirements [127].

The future represented by compressible DNS for all aeronautical cases is still far away. However an intermediate point between (U)RANS and and DNS is already avail-able today and is represented by the Large Eddy Simulation (LES) and the Detached Eddy Simulation (DES, and eventually DDES for Delayed Detached Eddy Simula-tions). It must be noted however that LES represents currently a very active research field [147]. Roughly speaking in the LES approach the user is able to choose the par-ticular scale that divides what is resolved (as in the DNS) and what is instead modeled (as in (U)RANS) in term of turbulence effects. Very small scales, that negligibly con-tribute to the final solution are just modeled, in a (U)RANS fashion. Bigger scales of an engineering interest are instead fully resolved. LES simulations are however at least one order of magnitude more expensive in term of simulation times with respect to (U)RANS ones. Although LES are not yet the standard in the aeronautical field, they surely represents the next future for viscous simulations. An important advan-tage of LES over (U)RANS is that the LES solution converges to the DNS solution by improving the mesh refinement. For (U)RANS equations instead, mesh convergence analyses should always be performed since from a mathematical point of view there is no guarantee to converge to the DNS solution. A less expensive alternative to LES is represented by the concept of DES. One of the main advantages provided by DES

(21)

for-1.1. Background, CFD/FSI and HPC state-of-the-art

mulations is that they can usually be implemented as simple modifications to an existing (U)RANS model [108,142]. The so-called DDES (Delyaed DES) formulation was also introduced [141]. In near wall regions and zones where the turbulent scales are smaller than grid dimensions the model is switched to the (U)RANS mode. Where instead the turbulence scales are bigger than the grid dimensions, the model is switched to the LES mode. DES strategies can provide better results in strongly unsteady simulations than plain (U)RANS formulations with relatively lower computational requirements with respect to plain LES formulations.

LES surely represents the future but is still years away from an adoption as the de-fault approach both in academic and industrial fields. (U)RANS and DES, opportunely tuned with experimental data, provide nowadays enough accuracy for a wide range of cases and operating regimes. Thus they will still represent the default approach for viscous cases in the next years.

Once the most suitable model is chosen, after considering both its accuracy and costs, the next problem is represented by the numerical aspects of its solutions. Of course there are problems for which the analytic solution can be easily found. However, when considering the solution of compressible Navier–Stokes equations for arbitrary geometries, with particular initial and boundary conditions, eventually with moving walls, some sort of numerical discretization is required.

A plethora of numerical schemes, algorithms and formulations can be found in lit-erature to discretize the same particular problem. Let us consider the solution of the compressible URANS equations which are the main goal of this work. URANS repre-sents a system of partial differential equations with temporal terms, convective terms, diffusion terms and source terms. They contain spatial and temporal derivatives of the unknowns and require consistent initial and boundary conditions to be specified in order for the problem to be well posed. The numerical discretization of the problem starting from its analytic representation is mandatory in order to be implemented as an algorithm that can be processed by a computer. Thus, analytic operators such as spatial and temporal derivatives have to be substituted by their numerical counterparts. The idea is to express the problem in a form that can be processed by a computer.

Different domain discretization approaches can be used. Usually the geometry of the problem is firstly specified (e.g. with an STL file) and then a mesh is generated using software like gmsh, gambit, icem, blockmesh, pointwise. However mesh-free approaches exist in which there is no particular connectivity between the numerical points on which the solution is defined. The Smoothed Particle Hydrodynamics (SPH) approach represents an example and is particularly used among Multi Body Systems (MBS) analyses. The main difference between the classical Finite Volume Method (FVM) and the SPH approach is that in the former case an Eulerian representation of the fluid is adopted (eventually an ALE formulation if considering mesh deformation), while in the latter a Lagrangian representation is adopted. In CFD, FEM is usually adopted for incompressible low Reynolds simulations. FVM is instead used usually for compressible and incompressible high Reynolds simulations. Both FEM and FVM for-mulations can be formally obtained from the representation of the (U)RANS equations in weak form. Usually a Petrov-Galerkin approach is adopted, in which the functional space of the test functions is the same adopted for the solution. However a Bubnov-Galerkin approach can be used as well, allowing to use two different functional spaces.

(22)

Within an FVM framework, a cell-centered formulation or a node-centered formulation can be adopted. In the former case usually the solution is considered uniform over the entire numerical cell [66, 127, 136]. In the latter, usually a linear interpolation (similar to a FEM approach) can be adopted to represent the solution [115].

Based on the chosen unknowns to describe the problem, two main families are usually adopted in CFD: pressure-based methods [102] and density-based methods [127, 136]. In particular, in density-based approaches [124] the unknowns are rep-resented by the density, the momentum and the specific total energy, i.e. conserva-tive variables. These would be the only unknowns in an Euler (inviscid) formulation. In (U)RANS formulations further variables, related to the turbulence model adopted, must be taken into account for the solution. Density-based schemes are usually well-suited for subsonic, transonic and supersonic cases. However when the Mach num-ber approaches 0, they are usually characterized by convergence problems. Different strategies, based on preconditioning [58, 96, 150] can be adopted to obtain an all-Mach formulation. In general, if a density-based formulation is chosen, different convective numerical fluxes can be adopted, such as Roe [94, 95, 136], AUSM+ [97], CUSP [145] and Jameson [85,86]. These are different examples of upwind fluxes that guarantee sta-bility during the convergence but have the main drawback to be only one order accurate. This problem can be tackled through the use of flux limiters [136] and automatically switch to a second order formulation wherever is possible in the computational do-main. In particular, in near-shock regions the second order contribution is switched off, to avoid oscillations. while it is fully recovered in smooth regions. Another important aspect is represented by the entropy fix (e.g. by Harten and Hyman [117]), necessary to avoid non-physic results.

For what concerns viscous fluxes, different numerical schemes with different costs and accuracy levels can be adopted to compute the required gradients. One of the most simple and robust scheme is represented by the Gauss formulation, well suited for cell-centered approaches. In this case the cell gradient is assembled adding up the contributions of the faces, contributions that are computed directly from the cell and neighbor cells. Since no upwind-like concepts are required for the viscous terms, a simple and cheap weighted average between the cells values is enough. Another pos-sible formulation is represented by the Least Square scheme (LS) [66] mostly used in node-centered approaches. This is less robust and more expensive but usually provide more accurate results.

For what concerns the temporal discretization of the problem, two main formu-lations exists: explicit and implicit schemes. Explicit schemes can be easily imple-mented, are easily parallelizable (since the solution at the new time depends only from the values stored at the previous times) and they require a small amount of memory since the matrix storage is avoided. Their main disadvantage is represented by the CFL constraint that limits the maximum value of the allowed time-step. On the other side, the solution of implicit iterations is more costly since it basically requires the solution of a linear system. The problem is also represented by the fact that Navier–Stokes equa-tions are non-linear, meaning that some sort of linearization is required to compute the solution. Of course Newton–Raphson method can be employed to perform this task, however it basically requires to perform multiple factorizations at each physical time step (although the same factorized matrix can be used for multiple iterations [115]).

(23)

One of the main drawbacks from a computational point of view of implicit schemes is represented by the high memory requirements, since the system matrix has to be stored (although in a sparse format). The main advantage of implicit methods is repre-sented by the possibility of using large time-steps (thus big CFL values) without severe stability problems. Segregated methods consists in a mixed approach between a fully implicit and a fully explicit scheme. Basically some equations are solved using lin-ear systems, like a typical implicit scheme. However other variables are updated in an explicit-like manner, usually the turbulence equations. These strategies allow a re-duction of the size of the system that would be required in a full implicit approach, at the cost of inferior convergence performances. A staggered approach [66] can be also implemented. As an example the solution of some equations is computed using the values previously obtained from the solution of other equations in the same iteration. That said, considering the allowed values of time-steps and the costs of each iteration, explicit method are usually preferred when small time-steps are required, e.g. when studying acoustics. Implicit schemes are instead preferred when large time-steps are required, e.g. to reach a steady-state solution. Different strategies can be adopted in order to speed up the convergence of explicit schemes, bypassing the CFL limit without actually violating it. The most important strategies are represented by the Local Time Stepping (LTS), Residual Smoothing (RS), Multi-Grid (MG) [48]. These schemes can be combined together to damp residuals and achieve convergence rates comparable to what provided by implicit formulations. These convergence acceleration techniques speed up the convergence of explicit methods to reach steady solution whith null resid-uals but cannot directly adopted in unsteady simulations. This is due to the fact that the time, at this point "pseudo time" loses its physical meaning. Dual Time Stepping (DTS) [48] can be adopted to perform unsteady simulation with explicit methods while maintaining convergence acceleration active. The idea behind DTS is basically to con-verge from one physical time to the next one by solving a steady problem with source terms representing the physical temporal derivatives. This way all the CFL problems are related to the pseudo time handled with LTS, while the physical time step can be chosen independently. With DTS it is possible to employ physical time steps that are not limited by CFL restrictions, allowing to reconstruct only the frequencies of interest, reducing the computational effort with respect to a global time stepping strategy.

The DTS technique can be used to solve the fully non-linear URANS equations in the time domain. This can be profitably used when unsteady complex non-linear phenomena are investigated on a generic computational domain. A subclass of aero-nautical problem is represented by turbomachinery and recently by the renewed interest in open rotors/propfans, which are usually characterized with time and spatial period-icity. Thus, in this class of problem, when the hypothesis of spatial and/or time pe-riodicity is valid, simplifications can be adopted to reduce the computational costs of the simulations. In particular, beside the non-linear domain schemes, also time-linearized [63] and the Harmonic Balance (HB) techniques [63] can be adopted. Time-linearized techniques allow a drastic computational effort reduction with respect to time domain strategies but cannot be adopted when strong non-linear phenomena such as separations and shocks occurs in the flow. However the Harmonic Balance technique can be adopted to provide a full non-linear frequency domain formulation. Generally, time-linearized and HB techniques reduce the total computational costs when a single

(24)

particular frequency has to be analyzed, since with a non-linear time domain approach an entire unsteady solution would be required. However as will be presented in this work, by exploiting an opportunely crafted aeroelastic system input it is possible to ex-cite a wide range of frequencies in a single unsteady analysis. This supports the choice of a non-linear time-domain solver, especially for flutter analyses.

Alongside CFD simulations that involves purely aerodynamic effects, a primary re-search field is represented by aeroelasticity. Reality is intrinsically aeroelastic, there is no non-deformable structure. Some structures are such that displacements under aero-dynamic loads are negligible from an engineering point of view. However there are important cases, like aircraft wings, where accounting for the structural response pro-vides more accurate results [115, 127]. This is more and more relevant as the adoption of innovative materials (i.e. composite [39]) and technologies advances (i.e. 3D print-ing [149]). This is particularly true for static aeroelasticity. The most important aeroe-lastic phenomenon in the aeronatical world is represented by flutter. This is a dynamic instability of the aeroelastic system, when structural and aerodynamic subsystems are considered coupled. Flutter must be investigated as it is directly related to self-sustained vibrations, thus fatigue life, thus safety. At the beginning of the first century of flight aeroelasticity was ignored during the design process. However catastrophic failures occurred and therefore it was evident that aeroelastic analyses needed to be part of the safety-check procedures during the design of new components. Nowadays the trend is to introduce aeroelastic analysis since the very beginning of the design process, leading to the requirement of high efficiency solvers, capable to fully exploit the state-of-the-art computational hardware. However, aeroelasticity is still nowadays an open problem. This is confirmed by the fact that the most recent (2015/2016) effort to asses the the ac-curacy of state-of-the-art aeroelastic solvers is represented by NASA’s Aeroelastic Pre-diction Workshop 2 (AePW2) [81], with the purpose of comparing results provided by different research groups from all over the world. In the past the AePW1 [80] was also adopted to pursue this goal with the HiReNASD and BSCW wings. The AGARD 445.6 wing flutter investigation [156] is another well know benchmark case. Benchmark cases for turbomachinery aeroelastic investigations also exist, like the SC (Standard Config-uration) [69] cases, e.g. SC10. While in turbomachinery aeroelastic investigations the aerodynamic damping analysis seems to be the most important kind of investigation, in literature it is not common to find static aeroelastic analyses. This is probably due to the higher stiffness of blade configurations with respect to more classical aeronautical wings. Besides turbomachinery, open rotors/propfans represent a recent trend in the aeronautical field. The first studies for this kind of configurations are from 1975 by NASA. At their very beginning CROR (Counter Rotating Open Rotors) [78] configu-rations were affected by high noise levels. The current need to find new high efficiency solutions for aeronautical propulsion indicated open rotors as a possible candidate, re-newing the interests over this kind of configurations [122]. Figures 1.1 show examples of pushing and pulling CROR configurations. As for turbomachinery and helicopter blades, open rotors represent a challenge from the numerical point of view. In fact with the need to account for compressibility and viscous effects from the purely CFD point of view, the structural deformability due to both aerodynamic and centrifugal ef-fects should be taken into account. This clearly suggests that for this kind of rotors what is often called multi-physics approach is required. Figure 1.2 shows the Collar’s

(25)

Figure 1.1: Examples of CROR puller (on the left) and pushing (on the left) configurations [122], www.redstar.org, www.gfdiscovery.blogspot.it.

triangle [47], highlighting the connection between different subsystems [127]. These triangle can also be modified adding the control subsystem [127]. The control subsys-tem is particularly important for different applications, e.g. gust alleviation [68]. These interactions could also trigger non-linearities [127]. Accurate aeroelastic simulations

Figure 1.2: Collar’s triangle, interaction between subsystems.

requires CFD formulations to be coupled with algorithms that allow mesh deformation. In particular, Radial Basis Functions (RBF) interpolation schemes can be adopted to build the so-called aeroelastic interface between the usually different structural mesh and aerodynamic mesh wall discretizations [52, 123]. An option is represented by In-verse Distance Weighting (IDW) algorithm [137, 154] that can be also used to update aerodynamic mesh internal nodes location by knowing wall displacements.

The schemes and algorithms chosen for the numerical discretization of the prob-lem are strictly related to computational aspects. As previously said, even smartphones have nowadays orders of magnitude the computational power of the computers of few decades ago. When performing numerical simulations with a computer, at least a basic knowledge of how computers work, from a software and hardware point of view, is required. This is necessary in order to efficiently implement the numerical algorithms.

(26)

Roughly speaking, from the hardware point of view, when performing simulations, the two most important aspects are represented by the total amount of memory available and the Floating Point Operations Per Seconds (FLOPS) achievable. Memory is strictly related to the problem size that can be handled: more memory means bigger problem sizes. FLOPS are instead related to the speed at which the computations can be per-formed: more FLOPS means that more computations can be performed per unit of time. It must be noted that the computational time required to obtain the solution does not depend only by the available computational power. Fundamental aspects are also represented by the convergence properties of the implemented schemes. In fact, let us consider a scheme A that requires very small time per iteration, but it requires a lot of iterations to reach convergence, and scheme B that is more costly in term of time per iteration but requires much less iterations to reach convergence. Since the goal is to obtain the solution as soon as possible, the scheme B could be the candidate to be implemented. Another important aspect from the computational point of view is that the choice of the algorithm must take into account also the particular architecture of the machine over which it will be executed. As an example, if a quad-core CPU is avail-able, in order to exploit the full available computational power, the chosen algorithm must be capable of split the work in chunks that will be distributed among cores. If instead the algorithm is intrinsically serial, 3/4 of the available computational power will be wasted.

Multi-core CPUs are available for personal computers since a decade. Until the first years of 2000 CPUs were basically single core processors. From year to year new architectures were presented with the goal of improving serial performances, mainly by allowing higher frequencies to be reached. As an example, Intel was publicizing its Pentium 4 processors with their GHz-range frequencies. Transistor scales where re-duced up to nanometers, approaching not only engineering limitations but also physical limitations. It was clear that the single-core performances, thus serial performances of processors were reaching their intrinsic physical limits with the available technology. Furthermore, increasing core frequencies to over 3 or 4 GHz became very difficult. Thus, in order to improve CPU performances, different strategies were adopted. First of all the multi-core concept. Basically the idea is to have multiple independent con-nected computational units, sharing memory, on the same socket. Modern operating systems like Windows, OSX and Linux distributions support multi-tasking thanks to concepts like processes, threads, time slices. When the operating system kernel is ex-ecuted on a multi-core CPU, it is allowed to schedule processes/threads for execution over the available cores. This way, when performing numerical simulations, with the right algorithm, it is also possible to distribute the work among the available cores.

Besides the CPU, another powerful chip installed in basically every modern com-puter is the GPU, the Graphical Process Unit. As the name suggests, the main purpose of this device is to accelerate computations strictly related to graphics. When perform-ing graphical computations, the same operation has to be performed on a large amount of pixel or vertexes, meaning that GPUs are intrinsically SIMD (Single Instruction Mul-tiple Data) devices. As an example, in order to draw a triangle on the screen and then translate it, basically it is needed to translate each vertex that compose the triangle. This job can be directly performed by the GPU in a parallel way by translating each vertex that compose the triangle, effectively offloading the CPU from the computations. Let us

(27)

consider now the sum of two vectors: the same operation has to be performed on each couple of numbers. This is basically the same operation needed to compute the new vertexes positions given the initial positions and the translation vector. Since GPUs are specifically designed to perform this kind of computations, it easily understandable why they outperform CPUs of the same price level in specific SIMD computations. The idea behind GPGPU (General Purpose GPU) is, as the name suggests, to use GPUs to per-form general purpose numerical computations. Of course, since the GPU architecture is inherently SIMD, only data-parallel algorithms can truly exploit its computational power. When this is possible, and the code is opportunely tuned, one order of mag-nitude speed-up can be potentially achieved by using a GPU instead of a CPU of the same price level. This might seem impressive, especially considering that almost every computer has nowadays a GPU that can be exploited to offload the CPU and accelerate specific type of computations. However it must be noted that GPGPU programming is somehow cumbersome and exposes numerous limitations with respect to classical CPU programming. Figure 1.3(a) clearly shows the differences between theoretical floating point computational power of CPUs and GPUs and in particular the trend of last years. It can be seen that AMD and NVIDIA GPUs exhibit higher theoretical FLOPS perfor-mances considering single precision. The comparison between Single Precision (SP) and Double Precision (DP) performances is a fundamental aspect that will be discussed in detail in this work. Figure 1.4 shows the advantages provided by GPU acceleration in term of performances to costs ratio. Besides the fact that the numbers showed in the figure are highly dependent from the particular chosen CPU/GPU combination, it is clear that adding GPUs to an HPC system contributes to reduce the performances to costs ratio. What is important to notice is not the numbers themselves, but the order of magnitude of the advantages given by using GPUs with respect to a CPU-only system. Obviously it is reminded that these are just theoretical numbers since when tackling a numerical problem it is not always possible to exploit the intrinsically GPU SIMD architectures. The first GPGPU approaches [77] were based on mapping the numerical problem to a graphical problem in order to exploit the graphical API (Application Pro-gramming Interfaces), communicate with the GPU and ask it to perform computations. Translating numerical algorithms into pixel/vertexes operations was very difficult at the beginning of GPGPU. Later, in 2007, NVIDIA launched CUDA, providing an easier way to access the computational power of GPUs for generic numerical computations. ATI (nowadays AMD), the main NVIDIA’s competitor in GPU market, launched its SDK for GPGPU programming, called ATI Stream. Nowadays different modern SDK and languages can be adopted for GPGPU computing, such as CUDA (NVIDIA only) and OpenCL (multiple CPU, GPU and FPGA vendors). OpenCL and CUDA offers a low level GPGPU programming capabilities. Other languages, such as OpenACC, offer more high level interfaces to GPGPU.

GPGPU is exploited in very different numerical fields such as CFD, finance, cryp-tography, machine learning, medics, signal and image processing, etc. Different soft-ware from different softsoft-ware houses offer the possibility of GPU acceleration. Further-more, wrappers and libraries for numerical computations are today available to exploit GPGPU by getting rid the user from low-level GPU programming. Examples are rep-resented by ViennaCL [35] or MATLAB.

(28)

draw-(a) Single Precision theoretical performances

(b) Single precision performances per Watt

Figure 1.3: CPU vs. GPU performances trend, [2].

backs and limitations when programming GPUs for general numerical computations. Concepts like branch divergence and coalesced memory access must be taken into ac-count in every algorithm that has to be implemented. Furthermore, a typical GPU exhibit less memory than the typical amount of system memory (RAM) available on a workstation. This means that explicit algorithms are preferred over implicit

(29)

algo-1.1. Background, CFD/FSI and HPC state-of-the-art

Figure 1.4: Performances to costs ratio, GPGPU advantages [3].

rithms, thanks to the avoided matrix storage and intrinsically local memory footprints. Alongside these problems, debugging and profiling GPU code is quite hard with re-spect to typical CPU code. Particular attention must be dedicated to memory accesses since buffer overflows could lead to unexpected crashes. Finding the exact line of code where buffer bounds are not respected can be very hard. Recently, a debugger was developed, called Oclgrind [27] that basically allows the programmer to obtain a Val-grind-like [34] debugging tool for OpenCL-based applications. Another feature of this great debugger is the possibility of checking for possible data-races.

An important aspect in GPGPU world is given by the differences between gaming GPUs (such as NVIDIA GeForce product line and AMD Radeon product line) and HPC GPUs (such as NVIDIA Tesla product line and AMD FirePro). Usually high-end gaming GPUs exhibit about the same single-precision (SP) computational power provided by HPC GPUs but just a fraction of their double-precision (DP) computational power [37, 38]. Furthermore HPC GPUs have nowadays more than 10 GB of memory, while high-end gaming GPUs have only about 4_{− 8 GB. Finally HPC GPUs features} ECC (Error Correcting Code) compliant memory. One of the goal of this project is to develop a solver that is capable of exploiting the single-precision computational power of gaming GPUs since they are usually one order of magnitude cheaper than HPC GPUs with about the same SP computational power. Of course using SP for the solution of Navier–Stokes equations often requires particular tuning of the code when precision loss could be a problem.

Finally, it must be noted that GPUs have better performances per Watt than CPUs as can be seen in figure 1.3(b). Thus there are also important advantages in using them (when it is possible) from the point of view of energy consumption. This and other aspects regarding GPGPU will be discussed in details in this work.

Usually numerical solvers run on CPUs, eventually with the possibility to exploit multi-core architectures through multi-threading, or clusters with multiple nodes using multi-processing through message-passing strategies. Few programs have the possibil-ity to offload part of the computations to GPUs in order to accelerate specific

(30)

compu-tations. Here the idea is instead to use OpenCL to parallelize almost all the solver’s algorithms in order to exploit SP performances of modern cheap gaming GPUs. How-ever, the idea is to obtain a solver that is natively compatible with both CPUs and GPUs thanks to the OpenCL runtime libraries and device drivers offered by the most impor-tant CPU and GPU vendors (Intel, AMD, NVIDIA). This way with a single source code set it is possible to achieve compatibility with the widest range of devices.

It is worth to note that despite the main target of this work is accelerating simulations through hardware/software-based techniques, other strategies like parametric comput-ing and reduced order methods can be exploited to further reduce computational times and increase simulation complexity [133].

1.2

Overview of the thesis

Here an overview of the chapters is presented. First of all the numerical formulations implemented in the solver are showed alongside the explanations of the choice from both a numerical and a computational point of view, considering that the main goal is a GPU-optimized solver with turbomachinery and open rotors extensions. Then GPGPU main concepts are presented, considering in particular the choice of the OpenCL API and language. Next, the software architecture of the solver is presented, with a detailed view of how GPGPU concepts are translated in the parallelization of different CFD/FSI tasks. Computational benchmarks are then used to assess the speed-up advantages of using GPUs instead of CPUs for the solver execution. Finally numerical results are presented in order to validate the solver’s numerical formulations. Different kind of test cases are adopted at this stage, raging from steady to unsteady, from aerodynamic to aeroelastic, from classical aeronautical to turbomachinery/open rotors cases.

Chapter 2

This chapter presents the numerical formulations implemented in the solver to solve the Navier–Stokes equations in an ALE framework. This is required to handle steady and unsteady aeronautical cases with fixed and deformable meshes. As previously said the OpenFOAM framework is adopted for the pre-processing phase, thus a finite-volume cell-centered formulation is adopted. The implemented convective numerical fluxes are here briefly presented, alongside strategies for high resolution, flux limiters and entropy fix. The convective fluxes formulations take into account also the ALE terms necessary for moving and deforming meshes. The implemented schemes for viscous fluxes are then briefly introduced with the implemented automatic wall treatment strategy. The most important boundary conditions for aeronautical cases are here introduced, con-sidering also transpiration boundary conditions to emulate moving boundary effects without actually updating wall points positions. For what concerns aeroelastic simula-tions, the adopted RBF-based strategy is here described to handle the interface between the aerodynamic and structural meshes. IDW algorithm is described to update internal point positions. Convergence acceleration techniques such as Local Time Stepping, Multi-Grid and Residual Smoothing are here described as the strategies to speed up the convergence of the explicit solver. Dual Time Stepping is also showed for unsteady simulations. The procedures for steady, unsteady, trim, flutter and forced oscillations analyses are also described in this chapter.

(31)

1.2. Overview of the thesis

Chapter 3

In this chapter an introduction to turbomachinery and open rotors problems and simu-lations is firstly provided. After the introduction of purely aeronautical cases formula-tions, this chapter is dedicated to the extensions implemented into the general-purpose solver that are required to handle rotating cases for turbomachinery and open rotors. The main purpose of these formulations is to speed up convergence of cases that ex-hibit spatial and temporal periodicity. MRF allows to simulate rotating domains without actually rotate the mesh. Cyclic boundary conditions are adopted to reduce the compu-tational domain to an N-blade sector when spatial periodicity is supposed to be related to N blades. Strictly related to this, IBPA and delayed boundary conditions concepts are presented, useful for unsteady cases when adjacent blades vibrates with a particular phase angle. Mixing plane strategy is then introduced to simulate the interface between two communicating blade rows (e.g. one stator row and one rotor row). Again, this is useful when used in conjunction with cyclic BCs in steady simulations. A comparison with other strategies of Navier–Stokes equations solution specifically designed for tem-poral periodic cases (time-linearized and Harmonic Balance) is also briefly presented alongside the reasons supporting the implemented approach (non-linear time march-ing).

Chapter 4

This chapter is aimed to introduce the reader to GPGPU concepts. The most important advantages and limitations of using GPUs to perform numerical computations are here presented. Concepts and problems related to branch divergence and sequential memory access are here explained. The attention is focused on OpenCL since it is chosen in this work as the API and language for GPU programming. The most important abstractions (e.g. devices, platforms, kernels) provided by OpenCL are here explained. Examples are provided to better explain when GPUs can be used to accelerate computations. This is used to understand the programming choices adopted in the solver. This chapter is not aimed to be an OpenCL or GPGPU tutorial but anyway could be useful to understand if a particular algorithm can be efficiently ported on GPU architectures.

Chapter 5

After the introduction of the main concepts related to both the numerical and the com-putational sides of the problem, this chapter is aimed to explain the comcom-putational aspects of the solver architecture, how the different subsystems communicate and how the solver algorithms are implemented and tuned. The solver uses C, C++ and OpenCL languages. In particular, two different source sets have to be created, one for the "host" that enqueue work and one for the "device" that actually performs aeroelastic computa-tions. In this chapter pieces of host and device code will be showed to provide a better feel of what really runs under the hood. Furthermore, this chapter better explain the possible bottlenecks with hybrid/unstructured meshes and why some algorithms are in-stead very efficient if run on a typical GPU architecture. The connection between the solver, AeroX, and the OpenFOAM framework is also showed. Furthermore strategies to improve numerical robustness and convergence are presented.

(32)

Chapter 6

This chapter shows the results for what concerns the purely computational side of the problem. The main purpose of using OpenCL is to obtain a solver that can exploit GPU acceleration with devices provided by the main GPU vendors, while retaining CPU multi-thread compatibility. In particular, an important advantage provided by OpenCL is the fact that there is no need to write different codes for different architectures: every CPU and GPU that is compatible with OpenCL can be immediately used by the solver (provided that the correct runtime is installed). This is translated into compatibility with a wide range of devices, and, for the purpose of this chapter, an easy way to compare the simulation times of CPU and GPU executions. In this chapter the speed-up obtained with different CPU and GPU architectures, provided by different vendors such ad AMD, NVIDIA and Intel is showed. As said AeroX is compatible with CPUs and GPUs but is tuned for GPU executions, so it is expected to perform better with this kind of devices. An APU (Accelerated Processing Unit) from AMD is also used for these benchmarks. The speed-ups are also analyzed considering isolated kernels in order to perform an accurate investigation of the computational efficiency of the different implemented algorithms. One of the main goals of this work is to exploit cheap gaming GPUs instead of more expensive, specifically designed HPC GPUs. Thus, the solver is here tested to check for possible problems due to the use of single precision instead of double precision and for possible differences due to the lack of ECC memory in gaming GPUs. This is done in order to respond to the most criticisms encountered by following the gaming GPUs choice. Obviously the solver is also natively compatible with HPC GPUs, double precision and ECC memory.

Chapter 7

In this chapter the solver is validated for the purely aerodynamic formulations side. Here, different test cases and benchmarks are used to show the capabilities of the solver to provide accurate inviscid and viscous solutions in a reasonable amount of time, con-sidering the fact that it can be executed on a relatively cheap desktop computer instead of computer clusters and high-end workstations. In this chapter typical aeronautical ge-ometries are adopted for the validation, such as the RAE and Onera M6 wings and the 2nd Drag Prediction Workshop. This kind of cases are used to show the capability of the solver to accurately predict viscous and compressible effects, with different range of Reynolds and Mach numbers. These cases are investigated before rotating (turboma-chinery and open rotors) cases and unsteady cases with mesh deformation in order to validate the solver for what concerns purely aerodynamic formulations for aeronautical cases.

Chapter 8

After the analysis of the purely aerodynamic aeronautical cases, in this chapter the solver is validated for what concerns static and dynamic aeroelasticity. Static aeroelas-ticity is assessed using a steady test case with deformable structure. For this purpose the well-known HiReNASD wing is adopted for the validation. For what concerns un-steady cases with moving walls, the solver capabilities are investigated through forced oscillations and flutter analyses. Flutter investigations are performed with the AGARD