Estimating velocity macro-models using stochastic full-waveform inversion

(1)

University of Pisa

Doctoral Thesis

Estimating velocity macro-models using

stochastic full-waveform inversion

Author:

Angelo Sajeva

Supervisor:

Prof. Alfredo Mazzotti

A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy

in the

Department of Earth Sciences

(2)

During my Ph.D. program, I have investigated two different topics. The first topic (major topic) addresses the issue of determining a suitable starting model for Full-Waveform Inversion (FWI); this topic has occupied the majority of my Ph.D. work. The second topic concerns determining a method to remove peg-legs from marine data sets acquired with a towed dual-sensor streamer. I worked on this subject during the first year of my Ph.D. In the following paragraphs, I outline my work on both of these subjects.

Estimating velocity macro-models using stochastic full-waveform inversion

I have developed a procedure that estimates an acoustic 2D macro-model of the subsurface using a genetic algorithm and that uses the information of the entire seismic wavefield in the objective function. The aim of this work is to demonstrate that such an estimated 2D macro-model is well-suited to act as the starting model for high-resolution gradient-based full-waveform inversion. High-resolution gradient-based waveform inversion (which is usually referred to simply as full-waveform inversion (FWI)) is an iterative local optimization method that exploits all information from the seismogram to produce a high-resolution image of the subsurface. In the last twenty years, FWI has received increasing interest from both the oil and gas exploration industry and academia. In spite of being a very promising method, FWI is limited by its local nature, i.e., it terminates in the nearest local minimum. For this reason, starting the inversion from a good first-guess model is a crucial factor. To the best of our knowledge, an efficient method to determine a reliable starting model for FWI has not yet been found. Industry and academia have developed a number of procedures that might provide an adequate starting model, but they are usually very time-consuming. Furthermore, the majority of these procedures requires the picking for the arrival time of a number of seismic events and, generally, picking is a time-consuming task, which is tedious, subject to interpretation, and prone to errors.

The approach that I propose does not require a tedious picking procedure, and, at the same time, it is resistant to falling into local minima. It is well-known that a starting model for FWI is a model that lies in the valley of the global minimum of a certain misfit surface. Consequently, the basic idea behind our approach is to attack the local nature of FWI by developing a global optimization method. A global method is able to escape from a local minimum because it is not driven by local derivatives of a misfit functional. A strong limitation of stochastic methods in geophysical inversion problems, and especially in FWI, is the so-called curse of dimensionality. To mitigate this issue, I have developed a simple strategy to reduce the number of unknowns of the model space in the synthetic inversions: each model of such a model space contains only the low wave-numbers and, hence, can be referred to as a macro-model. Another issue faced in this work is the high computational-cost of the stochastic full-waveform inversion. To reduce the overall computational-cost, I limited the seismic propagation from 3D to 2D and I performed several tests on an area of the subsurface smaller in size than those typically used in FWI tests. The time-step and space-step may be changed to further reduce the computational cost of each single forward-modeling.

(3)

ii

The most popular stochastic methods in Exploration Geophysics are genetic algorithms (GAs) and the simulated annealing (SA) method. Another popular method that originated in global seismology, is the neighborhood algorithm (NA). In this work, I compared a specific implementation of GAs, the Adaptive Simulated Annealing, and NA using two analytic objective functions and using a 1D elastic FWI problem. GA resulted the best performing method among the three for high-dimensional model spaces (¿40). Consequently, I selected the GA, and I employed this method for the 2D acoustic full-waveform inversions on synthetic seismic data.

The synthetic tests have been performed on both a portion and the entire Marmousi model. The Marmousi model is a synthetic model characterized by an intense layering, several faults, folds, and velocity inversions. Because of its complexity, it has been used widely as benchmark to test FWI algorithms. The outcomes of the synthetic tests have been employed as starting models for local FWI. I proved the validity of my methodology by comparing the final outcome after local FWI, with a reference workflow started from a smoothed version of the Marmousi model.

A two-step depeg-leg method for marine acquisitions with towed dual-sensor streamers

I present a two-step method that predicts (prediction step) and attenuates (subtraction step) peg-leg reflections in pre-stack seismic data acquired with towed dual-sensor streamers. The towed dual-sensor streamer permits to separate the wavefield in the down-going and up-going components. The advantage of having both the up-going and down-going wavefields separately available is exploited in the method.

In this method, a key role is played by the shaping deconvolution in local windows of the data, which is applied to the predicted peg-leg wavefield prior to the subtraction step. I show that for windows in which no peg-leg signal is present, the shaping deconvolution may alter the primary reflections and it must not by applied. Hence, I added an automatic control to the local shaping deconvolution that preserves the primary signals during the peg-leg removal.

Such a procedure has been applied to a synthetic data set generated by the reflectivity method. Using this data set, I verified the validity of the method and that the control added to the procedure dramatically improves the final result.

(4)

I want to thank Professor Alfredo Mazzotti for his constant support and encouragement. Mattia Aleardi for the pressure on comparing the performance of the different global methods, for sharing his matlab codes, useful discussions, his cooperative aptitude and the time spent together. Eusebio Stucchi for his perseverance and attention to the small details. Andrea Tognarelli for his patience when performing the velocity-analysis step and in dealing with real data issues. Fabrizio, Paolo and Mirko of the Cineca technical-support staff, because without their support I would not have been able to implement part of the code used during this thesis. Paolo Marchetti of ENI for the idea of starting the FWI project. Nicola Bienati for his useful suggestions, hints, and diffusion of stimulating papers. Mike Warner and the geophysics group at Imperial College for the permission to use their finite-difference code. ENI for the financial support. Sambridge for the permission to use his NA code. John Stockwell for his support when using SU software. Yong Ma for his prompt and helpful support about his method. Aaron Stanton for his code sumatchamp. Last but not least, I want to thank the three students who I worked with: Matteo Caporal, Giuseppe Provenzano and Davide Scalise, for their ideas and stimulating questions. Claudia and Luigi for their emotional support. Daniele for his inexplicable mind. Cesar and my family.

(5)

Introduction

“Do we need to mitigate the nonlinearity of the inverse problem through more sophisticated search strategies such as semi-global approaches or even global sampling of the model space? Because this search is quite compu-tationally intensive, should we reduce the dimensionality of the parameter models? To perform such global searches, should we speed up the forward model at the expense of its precision? If so, can we consider these procedures as intermediate strategies for approaching the global minimum? ”

Virieux and Operto “When in doubt, smooth.”

Sir Harold Jeffreys (Quoted by Moritz, 1980) “Large-dimensional spaces tend to be terribly empty. ”

Tarantola

The gradient-based full-waveform inversion, or simply full-waveform inversion (FWI), is an iterative local optimization method that exploits the entire information of the seismogram to determine a high-resolution image of the subsurface.

FWI is a step ahead of classical geophysical inversion methods because it exploits information from the entire seismogram. Differently, in the past, geophysicists typically reduced long, wiggly seismograms to a few bytes of information (e.g. travel times, phase

(11)

Chapter 1. Introduction 2 velocities) and tried to explain this information with approximate theories. Gradient-based FWI is a comparatively young method, with respect to, for example, travel-time tomography based on the ray theory approximation.

FWI was first applied in the 1980’s (Bamberger et al.,1982;Ikelle et al., 1988; Taran-tola,1988) and the interest in this method has grown progressively in the geophysical community following several successful applications to different different geophysical models with varied subsurface modeling approximations (e.g., acoustic, elastic, accounting for anisotropy, etc.) (Brossier et al.,2009;Gholami et al., 2013; Ratcliffe et al.,2011;

Warner et al.,2012). Because of the great number of successful applications, especially to synthetic data sets (e.g., Warner et al., 2012), as well as to some field data (e.g.,

Ratcliffe et al., 2011), this method has garnered high expectations for possible new future developments. Nevertheless, this method also features a few drawbacks, one of which is that it it is limited by its local nature, i.e., it terminates in the nearest minimum of the objective function, which may differ from the global minimum. For this reason, starting the inversion from a good first-guess model is a crucial factor in the success of the method. Determining an adequate first-guess model for the gradient-based full-waveform inversion is, at present, an open problem for the exploration geophysics research community. Henceforth, I will call this first-guess model the starting model for gradient-based FWI, or simply the starting model.

Which characteristics must a starting model for gradient-based FWI have? A good starting model is a model that lies in the valley of the global minimum of a certain misfit surface. This means that the starting model must be accurate enough to match most observations within less than half a period deviation. In fact, if the predicted phase at a certain frequency is off by more than ±π, the inverse algorithm will attempt to match the wrong cycle of the seismogram (Pratt, 1999), and the correlated forward and adjoint wavefields will not focus at the location of the scatterer, as demonstrated, for instance, by Gauthier et al.(1986) and Pratt et al.(1996).

A good starting model is also smooth. The aphorism – when in doubt, smooth –, quoted at the beginning of this chapter, is effectively a good direction to follow. A smooth model is preferred to a rough model or a model with sharp discontinuities because it is much more difficult for the gradient-based inversion workflow to suppress or shift a structure than to create a new one (Asnaashari et al.,2013). An incorrectly positioned sharp interface will produce a reflector that is not present in the observed data, and this may drive the gradient-based FWI to converge toward a local minimum.

At present, in the oil and gas exploration industry, a starting model for gradient-based FWI is commonly built by reflection tomography and migration-based velocity analysis (seeWoodward et al.(2008) for a review of the tomographic workflow). Migration velocity

(12)

analysis (MVA) (Al-Yahya, 1989; Yilmaz and Chambers, 1984) is an iterative method in which each iteration consists of two distinct steps: (1) the data are imaged by pre-stack migration and (2) the velocity function is updated according to the migration results. This method relies on a high-quality initial model, which is commonly obtained via reflection tomography. The accuracy of the velocity function is determined by measuring the focusing of the reflections in the migrated image. The coherence of the image in common-image gathers (CIGs) after migration is the criterion most frequently used to judge the focusing.

In the literature, several methods have been proposed for determine an initial model for gradient-based FWI. The majority of them are variations on the travel-time tomography introduced by Bishop et al. (1985). Travel-time tomography can be thought of as an inverse problem that consists of minimizing the differences between the travel-times measured from the data and the travel-times computed using a forward-modeling operator. The forward-modeling of the travel-times is usually accomplished by solving the Eikonal equation, either by ray-tracing or by direct solutions using Eikonal solvers.

Travel-time tomography may be referred to as reflection, refraction or transmission tomography, depending on the type of signals inverted for. It is worth noting that both reflection and refraction tomography are functions not only of the velocity model but also of the interface’s geometry. The interface’s geometry is itself a complex function of the velocity model and is usually obtained by interpreting the migrated image. Such indirect dependence of the travel-time on the velocity model through the interfaces geometry increases the modeling operator’s degree of nonlinearity with respect to the unknown parameters (the velocity function) and consequently increases the difficulty of these types of tomography.

Travel-time tomography requires the picking of the desired seismic events, i.e., reflected, refracted, or transmitted arrival times. Picking can be an exceptionally time-consuming task, even more so, when the inversion is three-dimensional. In the 3D case, the seismic events are 2-D continuous surfaces buried in a cube of seismic data. These surfaces are commonly called horizons, and in the worst cases, they can be interrupted by faults, which cause displacements of the horizons. Faulting does occur, e.g., in the Marmousi synthetic model (see Chapter6). In the case of faulting, the process of picking the arrival times of seismic horizons is especially prone to errors. Figure 1.1 shows an example of picking performed on a group of common-depth-point gathers (CDPs), the yellow points correspond to the picked arrival times of the first breaks. Among the different implementations of travel-time tomography, I will describe in more details two approaches that tend toward a more automatic procedure than those used in the oil and gas industry (i.e., reflection tomography plus migration-based velocity analysis). These approaches

(13)

Chapter 1. Introduction 4

Figure 1.1: An example of a picking procedure for a small number of CDPs. The yellow points are the picked arrival-times of the first breaks. Picking can be an exceptionally

time-consuming and extremely tedious task.

are: first-arrival travel-time tomography (FATT) (Nolet,1987) and stereotomography (Billette and Lambar´e,1998;Lambar´e,2008).

FATT performs nonlinear inversions of first-arrival travel-times to produce smooth models of the subsurface. Examples of applications of FWI to real data using a starting model built by FATT are shown, for example, inRavaut et al.(2004). Depending on the velocity model and on the desired target FATT can be a very efficient method. However, blind tests performed using joint FATT and FWI suggest that very low frequencies and very large offsets are required to obtain reliable results in a number of problems (Brenders and Pratt, 2007a,b,c). Furthermore, a drawback of FATT is that it is not suitable when low-velocity zones exist because the low-velocity zones create shadow zones (Virieux and Operto,2009).

Following these considerations, I expect that travel-time tomography methods that address both refraction and reflection travel-times will provide more consistent starting models for FWI. In this context, a promising technique is stereotomography. This method is an optimization of the controlled-directional-reception (CDR) method (Riabinkin,1957;

Sword,1986,1987) and it simultaneously inverts travel-times and the local slopes of events. Stereotomography has been implemented in several domains, and it is implemented via a semiautomatic picking procedure. Nevertheless, one of the authors noted that – much remains to be done for the practical use of stereotomography: for example, for fully recovering really complex velocity models such as Marmousi, and to improve picking which still remains a serious bottleneck – (Lambar´e,2008).

(14)

Figure 1.2: From the synthetic Valhall velocity model centered on the gas layer: On the left, velocity profiles at a distance of 7.5 km extracted from the true model (black line), from the starting model built by FATT (Prieux et al.,2009) (blue line), and from the corresponding FWI model (red line). On the right, same as the left panel apart from the starting model that is built by stereotomography (Lambar´e and Al´erini,2005), and the corresponding FWI model. The frequencies used in the inversion are between 4

and 15 Hz. (Images taken from (Virieux and Operto,2009).)

An example of gradient-based FWI using FATT and reflection stereotomography applied to the Valhall synthetic model is shown in Figure1.2 on the left panel and right panel, respectively. Stereotomography successfully reconstructs the large wavelength within the gas cloud down to a maximum depth of 2.5 km, whereas FATT fails to reconstruct the large wavelengths of the low-velocity zone associated with the gas cloud. However, the FWI model inferred from the FATT starting model shows an accurate reconstruction of the shallow part of the model. Both methods, therefore, present advantages and drawbacks.

Finally, a different approach for building a starting model for FWI can be provided by Laplace-domain and Laplace-Fourier-domain inversions (Shin and Cha,2008;Shin and Ha,2008). The Laplace domain can be viewed as a frequency-domain gradient-based waveform inversion that uses complex-valued frequencies, of which the real part is zero and the imaginary part controls the damping of the seismic wavefield. The method was successfully applied to the BP benchmark model, recovering the deeper structure of the starting model better than refraction travel-time tomography (Bae et al.,2011), and to a field data set from the Gulf of Mexico (Shin and Cha, 2009). Figure 1.3 shows the application of this method to the synthetic EAGE/SEG salt model. After 2000 iterations, the method returns a high-quality image of the subsurface that is close to the true model. This method is able to recover frequencies smaller than those effectively propagated by the seismic sources, and the authors call this ability ’mirage resurrection’. However, this method is also subject to a number of drawbacks. First, it requires an accurate picking.

(15)

Figure 1.3: The results of the waveform inversion in the Laplace domain for the SEG/EAGE3-D salt model. (a) a 2-D velocity profile of the SEG/EAGE 3-D salt model, (b) an initial velocity model, (c) the inverted velocity model at the 2, 000th_{iteration and}

(d) the migrated image generated from the inverted velocity model (panel c). (Images taken from (Shin and Cha,2009).)

In fact, Laplace-domain methods are sensitive to noise prior to the first arrivals because of the damping nature of the Laplace transform (Shin and Cha, 2008). Consequently, it is necessary to carefully mute the noise appearing before the first arrivals to obtain clean seismic data (Ha et al., 2012). This is the most critical pre-processing step because Laplace domain techniques give more weight to the early arrival signals and damp out late arrivals. Next, the method requires two additional parameters: a stabilizing factor and a set of damping coefficients for the inversion. The stabilizing factor must be added to the diagonal elements of the pseudo-Hessian matrix to avoid divisions by small values (Ha et al.,2012). The set of the damping coefficients for the inversion must be selected accurately, but, to date, a consistent procedure for selecting the damping coefficients has not been established (Shin et al.,2013).

So far I presented a brief review of the methods presently used to determine a starting model for gradient-based FWI. Note that all the methods require the performance of picking. Manual picking is an extremely tedious, and time-consuming task, and it can be affected by interpretation errors. Even in the case of a semiautomatic picking, human control of the exactness of the picking is required. Therefore, the possibility of avoiding the picking procedure is desired. My aim is to rely on a method that does not require a long tedious picking procedure and that is resistant to falling into local minima as well. The basic idea behind my method is to attack the local nature of full-waveform inversion by developing a global optimization method. Because full-waveform inversion is affected by falling into local minima, it is possible to overcome this by employing a global optimization method that leads the full-waveform inversion toward the global minimum. In fact, a global optimization method is able to escape from a local minimum because it is not guided by the local derivatives of the misfit function like the gradient-based

(16)

FWI. It instead performs a random-like direct-search on the model space. Accordingly, local and global methods are also called: derivative (or gradient-based) methods and derivative-free (or direct-search) methods, respectively. Global methods can be divided into two categories: grid-search methods and stochastic methods. Grid-search methods perform an extensive search of the model space by solving the forward-problem and by evaluating the misfit of each node on a regular grid that spans the entire search domain. Even in problems with a relatively small number of unknowns (< 20 − 30), this approach is extremely computationally expensive and is therefore avoided in favor of stochastic methods.

Stochastic methods rely on the probabilistic inverse theory. In this framework, the solution of the inverse problem is a probability distribution across the model space. From such a probability distribution, one can pick the set of models with probabilities that are higher than a certain threshold and treat them as the solution of the inverse problem. In the simplest case of a multidimensional Gaussian distribution with a narrow shape, the model located at the center of the Gaussian bell is the one that best solves the inverse problem. This single model is the solution of the inverse problem with an uncertainty proportional to the width of the Gaussian bell. The first applications of stochastic methods to the solution of geophysical inverse problems can be found in Keilis-Borok and Yanovskaja(1967) and Press(1968).

When a probability distribution has been defined over a space of few dimensions (1-4), one can directly represent the associated probability density. However, as the dimensions of the model space increase, such representations become increasingly difficult. This is the case of the majority of problems in Exploration Seismology, in which, typically, the number of unknowns is very large. Imagine the model as a point in a multi-dimensional vectorial space (the model space). Each unknown of the model is a direction of the multi-dimensional model space. As the dimensions of the model space increase, the number of points required to fill the space increases dramatically, and the data tend to be sparse. Locate a specific region of the space becomes progressively harder. A way to observe this, is to estimate the probability of randomly hitting the hypersphere inscribed in a hypercube as the dimension of the space grows. Figure 1.4 shows that the ratio of the volume of the hypersphere to the volume of the hypercube rapidly decreases to zero. When the goal is not to hit a large sphere as in Figure, but rather a small region of significant probability, located in an unknown portion of the model space, it is evident that implementing an efficient method that performs this task is far from trivial. The sparsity of data is problematic for any method that requires statistical significance. In fact, to obtain a statistically reliable result, the amount of data required to support the results grows exponentially with the dimensionality. Additionally, organizing and

(17)

Figure 1.4: The ratio of the volume of an hypersphere of dimension n to the volume of the hypercube of the same dimensions. Hitting the circle inscribed in a square by chance is easy. Hitting the sphere inscribed in a cube by chance is slightly more difficult. When the dimensions of the space grow, the probability of hitting the hypersphere inscribed in a hypercube rapidly becomes zero. This figure explains why the random exploration of large-dimensional spaces is always difficult. (Image taken from (Tarantola,1987).)

searching data often relies on detecting areas where objects form groups with similar properties. However, in high-dimensional data, all objects appear to be sparse and dissimilar in many ways, which drastically diminishes the efficiency of common data organization strategies. The group of phenomena that arise in large-dimensional model spaces is called the curse of dimensionality (Bellman, 1957). Several strategies have been developed to decrease the dimensions of the model space and increase the statistical significance of the inversion. I will present a strategy to attack the curse of dimensionality in Chapter 6.

A problem related to the curse of dimensionality and, more generally, to the computational feasibility of a stochastic inversion, concerns the computational cost of a single forward model. This is because, typically, tens or hundreds of thousands of forward models are computed during a single stochastic inversion. Reducing the computational cost of the forward-modeling, therefore, can greatly benefit the overall cost of the inversion. Aware of all these statistical and computational issues, I have developed a strategy to attack the problem of finding a starting model for FWI using a global direct-search stochastic algorithm.

Several stochastic methods are available and are frequently used in geophysical problems. The first problem that I encountered in this work concerned the choice of the most suitable stochastic method for our task. The most popular stochastic methods in exploration geophysics are genetic algorithms (Goldberg,1989;Holland,1975;Mitchell,1998) and the simulated-annealing method (Geman and Geman, 1984; Kirkpatrick et al., 1983). Another popular method is the neighborhood algorithm (Sambridge,1999a,b). Other significant algorithms include methods that use Hopfield neural networks (Hopfield,1982),

(18)

ant colony optimization (Dorigo and Gambardella,1997), particle swarm optimization (Kennedy and Eberhart,1995), and the Taboo-search method (Glover,1989,1990). To determine which method is most apt for our task, I compared three stochastic methods: genetic algorithms (GAs), an improved version of the Simulated Annealing developed by

Ingber (1989, 1993), i.e., the adaptive simulated annealing (ASA), and the neighborhood sampling algorithm (NA) (Sambridge,1999a). According to the results of this comparison, I have chosen the genetic algorithm to perform the stochastic inversion. Subsequently, I tested the method on synthetic data sets generated on the Marmousi model.

This thesis is organized as follows. Chapter 1is the present Introduction. In Chapter 2, I introduce the wave equation and I present a numerical method to solve the forward-modeling, namely, the finite-difference method. This method is the foundation of the forward-modeling routine that I use to simulate the wave propagation in the subsurface throughout this thesis. In Chapter 3, I introduce the gradient-based FWI. Without any pretense of completeness, I address the main characteristics of gradient-based FWI and a number of issues related to its current implementations. Chapter 4 briefly presents stochastic optimization methods. The chapter then focuses on Genetic Algorithms and the Neighborhood Algorithm. In Chapter5, I briefly introduce the Adaptive Simulated Annealing, and I show the results of a comparison-test between GAs, NA and ASA that I performed using two analytic functions varied dimensions of the model space, and a 1-D elastic inverse problem. Finally, an operative stochastic full-waveform inversion procedure is presented in Chapter 6. This procedure is applied to the classic Marmousi model, which is frequently used as a benchmark to test new full-waveform inversion algorithms. The results of the synthetic tests are discussed in the conclusions. The minor project, which concerns the implementation and application of a two-step depeg-leg procedure for dual-sensor streamer acquisitions, is presented and discussed in the appendix.

(19)

Chapter 2

The forward modeling

“Someone told me that each equation I included in the book [A brief History of Time] would halve its sales.”

Stephen Hawking (1942)

2.1 Introduction to wave propagation

The modeling of the seismic wave propagation in the Earth plays a key role in exploration seismology, because it allows us to understand the character of recorded seismic data. The motion of seismic waves is governed by the wave equation that relates the displacement of the ground-particles to external forces and to the distributions of density and elastic parameters. In active-source full-waveform inversion, the acoustic wave equation is frequently used in lieu of the elastic equation, i.e., even when the medium is not a fluid. That is because the numerical solution of the acoustic wave equation is computationally inexpensive compared to the solution of the elastic wave equation.

The acoustic wave motion can be fully described by a single displacement scalar field u, that contains information about the pressure wavefield , p, and that satisfies the equation:

∂2u ∂t2 − v

2

p∇2u = v2p∇ · s, (2.1)

where s is the source term, and vp is the acoustic wave speed, that, in general, is a

function of the spatial coordinates. In the equation, I denoted with ∇2 the Laplacian, 10

(20)

i.e., ∇2 = _∂x∂22 +

∂2

∂y2 +

∂2

∂z2, and with ∇ · s the gradient of the source term, i.e., ∇ · s =

∂sx

∂x + ∂sy

∂y + ∂sz

∂z . Note that, the acoustic approximation implies the restriction to isotropic

source radiation patterns and the absence of Rayleigh waves and P-to-S conversions. The acoustic approximation is, nevertheless, justifiable when the data analysis is restricted to the first-arriving P-waves and when the seismic sources radiate little S wave energy (e.g. explosions). In this work, I will follow the acoustic approximation and restrict my

studies to the acoustic wave equation.

2.2 Introduction to numerical methods

Exact solutions (or analytical solutions) of the wave equation exist only for a limited number of canonical models. In the case of more realistic heterogeneous media, the preferred choice is to approximate the wave equation by discretizing derivatives, and the related results are called numerical solutions of the wave equation.

A number of methods have been presented for the numerical solution of the wave equation and successfully applied to different types of subsurface models, among them: finite-difference methods (early applications can be found inAlterman and Karal(1968);Boore

(1970, 1972)), finite-element methods (Courant, 1943; Turner et al., 1956), spectral-element methods (Patera,1984), pseudo-spectral methods (Furumura et al.,1998;Kosloff and Baysal,1982), optimal operators (Geller and Takeuchi,1995), direct solution method (Cummins et al., 1994a,b; Geller and Ohminato, 1994), and discontinuous Galerkin methods (seeArnold et al.,2002, for an overview of the historical development of this method).

2.3 The finite-difference method

In the context of gradient-based full-waveform inversion, the two most frequently used methods are the finite-difference and spectral-element methods. Examples of applications of finite-difference modeling are shown inAlford et al.(1974) andKelly et al.(1976), and examples of applications of spectral-element methods are shown inMaday and Patera

(1989). Between these two methods, the finite-difference is the most simple and intuitive method. I will describe it in more detail in the following paragraphs.

Finite-difference method consists in approximating the derivatives by difference quotients. Given a function f of the spatial coordinate x, defined at certain grid positions (xi =

(21)

Chapter 2. The forward modeling 12 x0, x0+, ..., x0+ n), one can approximate its derivative in this way:

∂f (xi)

∂x ≈

f (xi+ ∆x) − f (xi− ∆x)

2∆x . (2.2) In practice, one approximates the limit of the difference quotient as the increment approaches zero (the derivative) with a difference quotient having a finite increment. Equation2.2 is a second order approximation of the derivative, and, depending on the desired accuracy, higher-order approximations can be implemented. A finite-difference approximation of the derivative of order 2N (also called finite-difference stencil of order 2N) satisfies: ∂f (xi) ∂x = N X n=1 gn[f (xi+ n∆x) − f (xi− n∆x)] + O(∆x2N), (2.3)

where gn , n = 1, ..., N are scalar coefficients. Different methods can be employed to

determine the values of the gn coefficients, e.g., using truncated Taylor series or Fourier

spectra (see e.g., Fichtner,2011). Higher-order approximations converge faster to the exact derivative as ∆x tends to zero, and, also for finite increment ∆x, they generally yield more accurate solutions. In the finite-difference approximation of the acoustic wave equation, one must approximate both space derivatives and time derivatives. For the sake of simplicity, I restrict to the simplest case of a propagation along a single direction in space (1D acoustic wave equation), and I disregard external sources. Hence, Equation

2.1can be rewritten as:

ρ∂

2_{u(x, t)}

∂t2 − k

∂2u(x, t) ∂x2 = 0,

where ρ is the mass density, k is the bulk modulus, and it holds that vp =

q

k ρ. As

initial conditions, I impose u(x, t)|t=0= u0(x), and ∂xu(x, t)|t=0= 0, and, as boundary

conditions, I require zero displacement at the borders x = ±L.

According to Marfurt(1984), the pressure wavefield can be defined only at certain points and it can be written in the following canonical form:

Md

2_u

dt2 + K · u(t) = 0, (2.4)

where, the matrices M and K are the mass matrix and the stiffness matrix, respectively. Note that, in Equation2.4, the space derivatives are replaced by a simple multiplication operation among matrices.

(22)

Different finite-differencing scheme can be chosen to approximate the time derivatives, here I show one of them, i.e., I approximate ∂_∂t22u by the second-order finite difference

∂2u ∂t2 ≈

1

∆t2[u(t + ∆t) − 2u(t) + u(t − ∆t)].

This yields to an explicit time-stepping scheme that permits us to advance the wavefield in discrete time steps ∆t.

u(t + ∆t) = 2u(t) − u(t − ∆t) + ∆t2M−1Ku(t),

i.e., known the displacement at time t and t−∆t, it is possible to compute the displacement at time t + ∆t. It suggests the following recipe:

1. Calculate u(∆t), using the initial conditions u(−∆t) and u(0). 2. Use u(∆t) and u(0) to determine u(2∆t).

3. For each n ≥ 2, use u((n − 1)∆t) and u(n∆t) -which have been calculated in steps (n − 2) and (n − 1), respectively- to determine u((n + 1)∆t).

This procedure is repeated as long as required.

2.3.1 The staggered-grid

In the finite-difference schemes described so far, grid points are evenly spaced and symmetric with respect to a central grid point in which the derivative is approximated (see Equation 2.2) . Even though this approach appears to be the most straightforward,

more efficient variations have been implemented. An important variation is the so-called staggered-grid discretization (Madariaga,1976;Virieux, 1984,1986), in which the first derivative is computed between the grid points and not at the grid points. This means that ∂xf (x) is approximated in terms of f given at the grid positions x − (n +1₂)∆x for

n = −N, ...., 0, ..., N − 1. Consequently, displacement, velocities, and accelerations are computed at different grid positions. For instance, the velocity-grid is space-shifted with respect to the displacement-grid of half the cell-width in each direction. Staggered-grid strongly outperforms the conventional grid in which each variable (displacement, velocity, acceleration) is computed at the same grid position. An intuitive explanation of the differences in accuracy between the two grids is provided in Figure 2.1. This figure displays the points in which the derivative of the sinc function is computed for the staggered grid (white circles) and the conventional grid (black squares), respectively. The finite-difference coefficients for the staggered-grid decay as n2, meaning that only

(23)

Chapter 2. The forward modeling 14

Figure 2.1: The derivative of the sinc function and the sampling points in the regular grid (black squares) and in the staggered grid (white circles). The finite-difference coefficients for the staggered-grid decay as n−2, meaning that only coefficients with small n effectively contribute to the discrete convolution. Differently, on the conventional grid the points decay slowly as n−1, thus, requiring a greater number of points to achieve the same accuracy of the staggered-grid approximation. (Image taken from (Fichtner,

2011).)

coefficients with small n effectively contribute to the discrete convolution. Differently, on the conventional grid the points decay slowly as n−1, thus, requiring a greater number of points to achieve the same accuracy of the staggered-grid approximation. Historically, the staggered-grid scheme yielded a breakthrough in finite-difference modeling.

2.3.2 Numerical stability

When explicit time-marching schemes are used for the numerical solution, the CFL condition, named afterCourant et al.(1928), gives a condition for convergence. The CFL condition is used to restrict the time-step in explicit time-marching computer simulations. For instance, if a wave is crossing a discrete grid distance, ∆x, then the time-step must be less than the time needed for the wave to travel to an adjacent grid point, otherwise the simulation will produce incorrect results. As a corollary, when the grid point separation is reduced, the upper limit for the time step must also decreases. and for stability the discretization must satisfy

∆t < c∆x vmax

, (2.5)

where ∆x is the cell-width in which the wave propagates, vmax is the propagation speed

of the fastest wave, that is the P wave, and c is a dimensionless number called Courant number. The value of the Courant number depends on the methods used for the space and time discretization. Its order of magnitude is 1. For 4th order spatial derivatives the Courant number is 0.606 (Sei and Symes,1995).

If Equation2.5is not satisfied, within a time step ∆t, the wavefront will cover a distance larger than c∆x, yielding to an unstable result. The unstable solution will propagate

(24)

Figure 2.2: Example of numerical (in)stability with a limit of vc∆xmax imposed by the

CFL condition. Top: The time increment t = 0.39 s is slightly below the threshold and the solution is stable. Bottom: The time increment t = 0.41 is slightly above the threshold leading to instabilities. After few iterations the numerical solution explodes.

(Image taken from (Fichtner,2011).)

though the whole model and will soon fill the seismogram with excessively large numbers that will increase in amplitude becoming soon out-of-range representation NaN (Not a Number) values. Figure 2.2compares a stable propagation and an unstable propagation. In case of unstable propagation the solution ’explodes’. The finite-difference algorithm, that I use in this thesis for the synthetic tests, requires the time step to be sufficiently short such that a wave travels less than 50% of the grid spacing, in one time step, at the highest velocity in the model.

2.3.3 Numerical dispersion

The numerical solution of the wave equation tends to disperse even if, in the analytical solution, all frequencies travel at the same speed. This undesirable phenomenon is called numerical dispersion or grid dispersion. Finite-difference schemes are intrinsically dispersive, but some rules can be given to reduce the amount of the numerical error, which is the difference between the correct analytical solution and the numerical solution. Three main factors control the numerical error :

• the dominant (or the minimum) wavelength, λ,

• the order of the finite-difference operator (2nd order, 4th order, etc) and • the number of iterations or time steps.

These three factors constrain the length of the grid spacing. Depending on their value, the grid spacing is allowed to increase or must accordingly decrease. Broadly speaking,

(25)

Figure 2.3: Snapshots of the propagation of an acoustic wavefront over a homogeneous medium, using (a) a finite-difference scheme with no dispersion, and (b) a dispersive finite-difference scheme. Note that the wavefront which is well defined in (a) becomes blurred in (b). In both (a) and (b), ∆x = 1 m, ∆t = 0.2 ms, fmax = 260 Hz. In (a)

vp=500 m/s, and in (b) vp=300 m/s. (Images taken from (Thorbecke,2013).)

the grid spacing should sample a certain number of times the dominant (or the minimum) wavelength. Consequently, as the dominant wavelength λ decreases, the grid spacing must accordingly be reduced, and, as the dominant wavelength increases, the grid spacing is allowed to increase. The order of the finite-difference approximation determines the accuracy of the finite-difference scheme, thus, as it increases, less points per wave-number are required to approximate the wavefield solution. Finally, the numerical error increases linearly with the number of time steps. According to these considerations, a widespread rule of thumbs says that, for a scheme with a fourth-order accuracy in space, at least five points of the finite-difference grid are necessary to sample the distance covered by the minimum wavelength that propagates through the model (the so-called “5 points per wavelength” rule of thumb) (Alford et al.,1974)

∆x < vmin 5fmax

, ∆x < λmin 5 .

According to Sei and Symes(1995), this rule must be understood valid for an average geophysical medium, and for the propagation of approximately 100 wavelengths through the medium. Figure2.3 shows examples of finite-difference propagations in case of no dispersion and with dispersion, on the left and on the right, respectively (Thorbecke,

(26)

2.3.4 Boundaries of the model

2.3.4.1 The free surface

It is difficult to accurately implement the free surface because of the non-local nature of the finite-difference approximations. That is, the finite-difference method requires to know values of points located beyond the interface. The non-trivial nature of the free surface in finite-difference methods has caused the development of several numerical techniques, e.g., vacuum formulations (Zahradn´ık and Urban, 1984), image methods (Levander,1988) and interior methods (Kristek et al.,2002;Moczo et al.,1997).

2.3.4.2 Absorbing Boundaries

When simulating the wave propagation, it is necessary to restrict the computational domain to only a part of the true physical domain. This introduces artificial reflecting boundaries, that causes artifacts that may pollute the solution and dominate the numerical error, if not treated adequately. The most common methods used to suppress these artifacts are: Absorbing boundary conditions and absorbing boundary layers.

Absorbing boundaries conditions prescribe the wavefield and its derivatives along the artificial boundary. In the case of low-order conditions (e.g.,Clayton and Engquist, 1977;

Engquist and Majda, 1977) the absorption is perfect only for waves propagating normal to the boundary, and it decreases in efficacy when the angle deviates from the orthogonal direction. High-order absorbing boundary conditions (e.g., Higdon,1991;Keys,1985) partially overcome this deficiency.

Absorbing boundary layers are based on the extension of the computational domain along the artificial boundary by a thin region in which the wave equation is artificially attenuated. The simplest absorbing boundary layer technique is the Gaussian taper method, which iteratively multiplies the incident wavefield by numbers smaller than 1. Figure 2.4, compares a synthetic acquisition over a homogeneous model obtained using (a) no absorbing boundary layers, and (b) Gaussian taper boundary layers. Note the artificial reflections at borders in (a) which disappears in (b). In 1994, B´erenger

introduced the perfectly matched filter (PML) method, which is both more efficient and numerically more involved than the Gaussian taper. PML method is based on variants of the wave equation that produce waves with exponentially decreasing amplitude.

(27)

Figure 2.4: Receiver recordings in homogeneous medium (a) without and (b) with taper. The receivers are placed at 300 m depth and the source is positioned in the middle of the model at (500,500). The grid distance is 1 meter. (Images taken from

(Thorbecke,2013).)

2.4 The finite-difference software

The forward-modeling engine I use is the finite-difference method, implemented using a routine provided us by ENI and originally developed by the Geophysics Department of the Imperial College of London (Warner et al.,2010). This routine is provided in the fullwave3D package. In the following, I will refer to this routine as the finite-difference scheme of fullwave3D, or simply fullwave3D-FD (Warner,2011).

In fullwave3D-FD, the finite-difference stencil is five nodes wide in space, and it includes nodes in oblique directions. The finite-difference weights are optimized such that the stencil is approximately fourth-order accuracy in time and better than fourth-order in space. It suffers minimal numerical dispersion and anisotropy at five grid points per wavelength. For practical implementations, four grid points are normally adequate if the lowest-velocity material is confined to thin layers, as for example in shallow water with a fast seabed. Conversely, where there is a significant thickness of low-velocity material, as for example in deeper water, then five grid points per wavelength are recommended to avoid numerical dispersion at the highest frequencies.

The absorbing boundaries are implemented using a mixture of Gaussian taper and perfectly matched layers (Skarlatoudis et al.,2006;Warner,2011).

(28)

Gradient-based full-waveform

inversion

“When a traveler reaches a fork in the road, the l1-norm tells him to take either one way or the other, but the l2-norm instructs him to head off into the bushes.”

John F. Claerbout and Francis Muir, 1973

3.1 Introduction

Gradient-based full-waveform inversion is a data-fitting procedure that (i) numerically solves the equations of motion, (ii) exploits the information of the full waveform and (iii) uses it to iteratively improve the tomographic image of the subsurface. First, note that the entire information of the wavefield is used in the inversion procedure. This differs from conventional tomography, based on the ray theory approximation, in which only the information carried by travel times is exploited. The challenge of full waveform inversion is to retrieve information also from the shape of the waveforms.

Regarding the numerical methods that can be employed to solve the equation of motion, I listed several of them in Chapter 2. Hence, I refer the reader to that chapter for more information on that topic. Here, I rather wish to stress that before choosing the numerical method, it is important to determine which physical approximation

(29)

Chapter 3. Gradient-based FWI 20 to make of the wave-physics complexity. The majority of the case studies of FWI approximate the wave-propagation as acoustic and isotropic (e.g., Ratcliffe et al.,2011). The acoustic approximation can lead to unreliable amplitudes. However, it has two important advantages:

1. the forward-modeling is less computationally expensive,

2. acoustic FWI is better posed than elastic FWI because only the dominant parameter vp takes part in the inversion.

Successful applications of acoustic FWI to synthetic elastic data have been presented by

Brossier et al. (2009) andWarner et al.(2012). Nevertheless, further complexity can be added in the wave-propagation modeling by including shear waves (elastic propagation), different degrees of anisotropy, mainly vertical transverse isotropy (VTI) or tilted trans-verse isotropy (TTI), and attenuation (visco-elasticity). See, e.g.,Gholami et al.(2013) for a study of optimal parameters for VTI acoustic FWI. In general, the correctness of the chosen approximation should be validated by comparison with the observed data. The information carried by the predicted seismogram is exploited by means of a misfit functional. The misfit functional is a measure of the distance between the predicted data, dpre, and the observed data, dobs. A common choice for the misfit functional is the

distance in the L2 norm

C(m) = s X t X x (dobs− dpre(m))2,

i.e., the square root of the sum of the squares of the differences of the data sets. The L2-norm is popular because it is easy to implement and it is elegant by a theoretical point of view. Nevertheless, its drawback is that it lacks of robustness, i.e., a few mispositioned data points (outliers) in the data set may compromise the final result.

Another popular norm used to form the misfit functional is the L1 norm C(m) =X

t

X

x

|d_obs− dpre(m)|.

The L1-norm is more reliable than the L2-norm in the case that outliers affects the data (Claerbout and Muir,1973). Some authors proposed hybrid L1/L2 norms, such as the Huber norm (Guitton and Symes, 2003). Other possible misfit functionals are the measure of cross-correlation time shift between observed and predicted waveforms (Luo and Schuster, 1991), measurements of time-frequency misfits (Fichtner et al.,2008), and the instantaneous phase and envelope (Bozda˘g et al.,2011). No common opinion

(30)

exists whether a norm is better than the other, and the best norm to use is frequently problem-dependent. In addition, appropriate pre-processing, such as, muting, trace-by-trace normalization, time and offset gain, and filtering, may strongly improve the results.

The information carried by the misfit functional is employed to iteratively improve the matching between the observed and predicted data sets. The iterative procedure starts from a best-guess model, that may be obtained, e.g., using travel time tomography (see Chapter 1). In the following paragraphs, I will refer to this best-guess model as the “starting model”, and I will denote it with m0. The starting model is improved using a

sequence of linearized gradient-based local inversions. At each iteration an update of the model, ∆m, is computed. This update consists of a descent direction (the versor) in the model space, h = _|∆m|∆m and a step length (the amplitude), λ = |∆m|. That is ∆m = hλ. Several popular minimization schemes exist, they can be divided in two categories: (i) the methods that require only the computation of the first derivatives, and (ii) the methods that require the computation of the second derivatives (the Hessian matrix). Methods that fall in the first group are less accurate but computationally simpler. In this group, the steepest-descent (Debye,1909) and conjugate-gradient methods (Hestenes and Stiefel, 1952) are two of the most popular methods. Methods that fall in the second group are more accurate but also more computationally intensive. They comprehend the Newton’s method and its variants, such as the popular Gauss-Newton method (examples of application of the Newton’s method in geophysics are inSantosa and Symes(1988) and

Pratt et al.(1998)). For a number of problems the computation of the Hessian matrix is impracticable and, the gradient-based methods must be chosen. Given a minimization scheme, being it gradient-based or hessian-based, the iterative procedure is repeated until the synthetic data explain the observed data sufficiently well, or when a certain termination criterion is met.

In the paragraphs below, I briefly introduce the steepest-descent method and the conjugate-gradient method. In both methods, steepest-descent and conjugate-gradient, the gradient of the objective function is generally computed with the adjoint-state method (Sei and Symes,1994) which allows to avoid the explicit computation of the sensitivity matrix. These two methods are employed in the experimentations described in Chapter

(31)

Chapter 3. Gradient-based FWI 22

3.2 The steepest-descent method

The steepest-descent method finds the descent direction, h0, that improves the starting

model, m0, as much as possible in the first iteration already, given a suitable step length,

λ0. Hence, the direction, h0, to find is the one that leads to the maximum decrease of

the misfit functional , C(m), i.e., the one that minimizes

C(m0+ λ0h0) − C(m0) ≈ λ0h0· ∇Cm(m0),

where ∇Cm(m0) + ∂C(m_∂m0) is a directional derivative (∇Cm(m0) is a vector of dimension

M , (where M is the dimension of the model space), that contains the perturbation amplitudes of the misfit associated to a small perturbation of each model parameter). This yields the inequality:

λ0h0· ∇Cm(m0) ≥ −λ0k∇Cm(m0)k2kh0k2. (3.1)

Note that, in the right side of the equation, the term kh0k vanishes because it is equal to

1. Thus, the maximum descent corresponds to the direction h0 for which the equal sign

holds in Equation 3.1. This is the case for h0 = −

∇C_m(m0)

k∇Cm(m0)k2

,

which is the direction of steepest descent. The perturbation model is chosen in the opposite direction of the steepest ascent (i.e., the gradient) of the misfit function at point m0. The steepest descent procedure can be summarized as follows:

1. Choose an initial model m0 and set i = 0.

2. Compute the gradient for the current model, ∇Cm(mi).

3. Update mi according to mi+1= mi+ λi∇Cm(mi), with a suitable step length that

ensures ∇Cm(mi+1) < ∇Cm(mi).

4. Set i → i + 1, go to Step 2 and repeat until the data are explained sufficiently well. Despite being conceptually simple, the steepest-descent method tends to converge rather slowly towards an acceptable model. This is because a succession of descent directions that are locally optimal may not necessarily be optimal from a global perspective.

(32)

3.3 The conjugate-gradient method

The conjugate-gradient method can be elegantly formalized for purely quadratic function-als with positive definite Hessian. In such a case, the conjugate-gradient method is a direct matrix solver that converges to the exact solution after at most M iterations (see, e.g.,

Fichtner,2011for more details). In brief, first, some preferred directions called conjugate directions are determined, which constitute a basis of orthogonal descent-directions, next, the model is updated on these directions using an adequate step length.

In the general case of a non-linear misfit functional, the orthogonality relations do not hold and the algorithm may need more than M iterations to reach an acceptable result. Here, the model is updated at the n-th iteration in p(n) direction, which is a linear combination of the gradient at the n-th iteration,∇C(n) , and the p(n−1) direction :

p(n)= ∇C(n)+ β(n)p(n−1),

where β(n)is a real-valued scalar designed to guarantee that p(n)and p(n−1)are conjugate. Among the different variants of the conjugate-gradient method, the Polak-Ribi`ere formula (Polak and Ribi`ere,1969) is generally used in FWI to derive the expression of β(n):

β(n)= h∇C

(n)_{− C}(n−1)_{| ∇C}(n)_i

k∇C(n)_k2 .

Note that only three vectors of dimension M , i.e., ∇C(n)_{, ∇C}(n−1)_{, and p}(n−1)_{, are}

required to implement the conjugate-gradient method. In FWI, the preconditioned gradient W_m−1∇C(n)_{is used for p}(n)_{, where W}

m is a weighting/preconditioning operator.

The aim of the preconditioning operator is to approach the Hessian matrix to the unit matrix, because this speeds up the convergence toward the best model.

The conjugate-gradient method is one of the most popular local optimization algorithm for solving FWI problems in the last decades (Crase et al., 1990; Mora, 1987; Tarantola,

1987). Compared to the steepest descent, conjugate-gradient method converges faster, and the improved convergence comes at nearly no additional computational costs (Fichtner,

2011).

3.4 Time domain vs. frequency domain

Historically, first attempts to tackle FWI have been performed in the time domain (Gauthier et al., 1986; Mora,1987; Tarantola,1984), and the frequency domain approach was proposed mainly in the 1990s by Pratt and collaborators (Pratt,1990;Pratt and

(33)

Goulty,1991;Pratt and Worthington,1990) first with applications to cross-hole data and later with application to wide-aperture surface seismic data (Pratt et al.,1996). Which one of the two implementations is best, is a question that is still debated. The frequency domain is a more natural framework for the multi-scale approach (see next paragraph) in which successive inversions of increasing frequencies are performed. Vice versa, the time domain is the most appropriate when a time windowing on the data is used to improve the inversion, or if other time dependent operations are required (e.g., time gain, trace by trace normalization, envelope computation,s etc.) to improve the inversion procedure. Of note, time-windowing allows the extraction of specific arrivals of FWI (early arrivals, reflections, PS converted waves), and this feature is often useful to mitigate the nonlinearity of the inversion (Brossier et al.,2009;Morgan et al.,2013;

Sears et al.,2008).

3.5 Model wave-numbers and data frequencies

In this paragraph I introduce some topics which are related to each others: (i) the roughness of the starting model, (ii) the cycle-skipping artifacts, (iii) the roughness of the misfit surface and (iv) the multi-scale approach.

As mentioned in Chapter 1, a good starting model for gradient-based FWI is smooth and does not contain sharp discontinuities. In other terms, it contains the low wave-numbers of the subsurface. Broadly speaking, one needs the low frequency content of the seismic data, i.e., < 4 Hz, which is usually missing in exploration seismic data, to invert for the low wave-numbers of the model. Whenever it is possible, this lack of information is retrieved elsewhere (a priori information, kinematics) and inserted in the starting model.

Sirgue(2006) stressed that the presence of low frequencies in the field data is normally essential for robust and effective inversion. Note that the starting model contains the low wave-number information that cannot be inverted from the data. In other words, there is a trade-off between the low frequency content in the data and the required accuracy of the starting model.

The cycle-skipping artifact is the phenomenon by which the predicted data matches the wrong cycle of the observed data. It can occur if the starting model is not accurate enough. According to the Born approximation, the starting model must match the observed travel times with an error less than half the period (Beydoun and Tarantola,

1988), otherwise, the cycle-skipping artifact will take place (see Figure 3.1), and the algorithm will converge toward a local minimum. A general rule is that low frequencies allows a less accurate matching than high frequencies. This suggests to start the inversion

(34)

Figure 3.1: Schema of cycle-skipping artifacts in gradient-based FWI. The solid black line represents a monochromatic seismogram of period T as a function of time. The upper dashed line represents the predicted monochromatic seismogram with a time delay greater than T /2. In this case, the gradient-based FWI will eventually match the (n + 1)-th cycle of the predicted seismogram with the n-th cycle of the observed seismogram, leading to an erroneous model. In the bottom example, FWI will update the model correctly, because the time delay is less than T /2. (Image taken from (Virieux

and Operto,2009).)

using only the low frequencies and to progressively incorporate higher frequencies at each iteration. This approach is called the multi-scale approach (Bunks et al.,1995) and it is illustrated in Figure 3.2. It permits to (i) mitigate non linearity and ill-posedness of the inversion, it promotes (ii) convergence towards physically meaningful models and it helps in (iii) avoiding the cycle-skipping artifact.

The benefits of the multi-scale approach may be understood also observing the shape of the misfit functional. First, note that the cycle-skipping artifacts is due to the falling of the inversion into a local minimum of the misfit functional. Second, note that the complexity of the misfit functional is directly proportional to the dominant length scale in the Earth model. That is, rough Earth models generate rough misfit functions with several local minima. Differently, the misfit functionals corresponding to smooth models tend to be smooth with fewer local minima. The multi-scale approach indeed starts the inversion with long-period data to constrain the long-wavelength structure. Next, it inverts for shorter wavelength structures, using shorter period data. Note that his strategy greatly reduces the requirements concerning the initial model, simply increasing the extent of the valley of the global minimum.

(35)

Figure 3.2: Illustration of the multi-scale approach for the case of a three-stage inversion taken from (Fichtner,2011). Top: the first stage of the inversion uses only the long-period data. The inversion starts from the initial model m0 (empty circle) and

iteratively moves toward the best long-wavelength model (filled circle). Note that the basin of attraction (shaded area) is large. Centre: the best model from the first stage is used as initial model in the second stage that uses intermediate-period data. The misfit functional is rougher and the basin of attraction around the optimum is narrower, but it is balanced by the fact that the initial model is more accurate. Bottom: The best model from the second stage is used as initial model in the third stage where short-period data are used to retrieve the short-wavelength components in the model. The basin of attraction around the global minimum is particularly narrow. As the inversion proceeds from one stage to the next, increasingly many details appears in the tomographic image.

3.6 Ill-conditioning

Another important topic in full-waveform inversion relates to the condition number. The condition number of a function with respect to an argument measures how much the output value of the function can change for a small change in the input argument. In this context, the argument is the geological model, the function is the direct problem, and the output is the seismic data set. Low condition numbers mean low sensitivity of the output to input changes and vice versa. Problems with low condition number are said to be well-conditioned, and problems with high condition number are said to be ill-conditioned. Geophysical problems are typically ill-conditioned. Ill-conditioning is partially due to the acquisition geometry (limited aperture, limited angle), and the type of signals recorded (reflected, refracted, transmitted).

(36)

From an historical point of view, the first applications of FWI were limited to seismic reflection data and they failed in reconstructing the intermediate wavelengths of the model structure. This because, for short-offset acquisitions, the seismic wavefield is almost insensitive to intermediate wavelengths. FWI has been trusted as an efficient seismic imaging technique since mid-late 1980s, whenMora(1987),Mora(1988), Pratt and Worthington (1990), Pratt et al. (1996) and Pratt (1999) discovered the benefit of wide-angle/long-offset refracted and transmission data to reconstruct the large and intermediate wavelengths of the subsurface structure using FWI.

Figure 3.3 illustrates how the resolution of the inversion is strongly affected by the aperture of the data. In the Figure, three different experimental setups are shown: (i) cross-hole, (ii) double cross-hole, and (iii) surface acquisition. In the cross-hole experiment, an array of sources is placed at the surface, and an array of receivers is placed at depth. In the double cross-hole experiment, the data, generated by a vertical array of sources located on the left and recorded by a vertical array of receivers on the right, are added to the data recorded by the cross-hole experiment. In the surface acquisition, both sources and receivers are placed at the surface. The true model is an inclusion in a homogeneous background. In the cross-hole experiment, there is an anisotropy in the imaging. This comes from the fact that the vertical wave-numbers are reconstructed by means of transmission-like wavepaths whereas horizontal wave-numbers are reconstructed by means of reflection-like wavepaths. In the double cross-hole experiment, both reflection-like and transmission-like wavepaths contribute to reconstruct the vertical and horizontal wave-number spectra of the inclusion. Consequently, a better reconstruction is achieved with comparable resolution on the horizontal and vertical directions. Differently from the two previous experimental geometries, the surface acquisition relies on a shorter range of propagation angles and it lacks in large-aperture illumination. No transmission-like wavepath are registered and the information is limited to the reflection-like wavepaths only. Consequently, both the vertical section and the horizontal section exhibit a strong deficiency in the reconstruction. In particular, the vertical section lacks in the low wave-numbers content, and the horizontal section of the inclusion is poorly recovered because of the poor illumination of the horizontal wave-numbers from the surface. This comparison stresses that wide apertures in offset are required to resolve the large wavelenghts of the medium, and, equivalently, to design well-conditioned FWI problems.

The quest for a broad aperture in azimuth pushed for the introduction of three-dimensional acquisition geometries. Wu and Toks¨oz (1987), andSirgue and Pratt (2004) showed that both temporal frequency and aperture angle control the spatial resolution of the imaging, therefore, the broader the range of aperture angles in the acquisition geometry, the broader and more continuous the spectrum of wave-numbers resolved by the seismic imaging.

Estimating velocity macro-models using stochastic full-waveform inversion

University of Pisa

Doctoral Thesis