Full waveform inversion with random shot selection using adaptive gradient descent

Kuldeep; Shekar, Bharath

doi:10.1007/s12040-021-01679-y

Full waveform inversion with random shot selection using adaptive gradient descent

Published: 07 September 2021

Volume 130, article number 183, (2021)
Cite this article

Download PDF

Journal of Earth System Science Aims and scope Submit manuscript

Full waveform inversion with random shot selection using adaptive gradient descent

Download PDF

371 Accesses
Explore all metrics

Abstract

Full waveform inversion (FWI) is a powerful yet computationally expensive technique that can yield subsurface models at high resolution. Randomly selected shots (mini-batches) can be used to approximate the misfit and the gradient of FWI, thereby reducing its computational cost. Here, we present a methodology to perform mini-batch FWI using the Adam algorithm, an adaptive optimization scheme based on stochastic gradient descent. It provides for stable model updates by smoothing the gradient across iterations and can also account for the curvature of the optimization landscape. We describe empirical criteria to choose the hyperparameters of the Adam algorithm and the optimal mini-batch size. The performance of the outlined scheme is illustrated on synthetic data from the Marmousi model and compared with conventional full-batch FWI. FWI with random shot selection and optimized by the Adam algorithm exhibits rapid convergence and yields superior results compared to conventional FWI implemented with the l-BFGS method.

Fast ABC with Joint Generative Modelling and Subset Simulation

Convolutional Neural Network-Assisted Least-Squares Migration

Article 14 March 2023

Boming Wu, Hao Hu & Hua-Wei Zhou

Gauss-Newton and L-BFGS Methods in Full Waveform Inversion (FWI)

1 Introduction

Full waveform inversion (FWI) is a computationally intensive technique partly due to its requirement of repeated wave simulations (Virieux and Operto 2009). Source encoding and random shot selection are two commonly employed starategies to decrease the computational cost of FWI. Source encoding can be implemented either during acquisition (Neelamani and Krohn 2008) by simultaneously injecting multiple sources with varying source signatures or during migration (Romero et al. 2000; Perrone and Sava 2009; Tu and Herrmann 2012) and FWI (Krebs et al. 2009; Van Leeuwen et al. 2011; Matharu and Sachhi 2018; Rao and Wang 2017) by convolving sequentially fired sources with randomly generated encoding sequences. The encoded sources act as a single effective spatially extended source, thereby reducing the number of per-source wavefield iterations.

Random shot selection involves approximating the misfit and the gradient by a randomly chosen subset of shots at each iteration (Diaz and Guitton 2011; Van Leeuwen and Herrmann 2013; Ouellet et al. 2017; Yang et al. 2018). The subset of selected shots constitutes a ‘mini-batch’ (Yang et al. 2018), borrowing from machine learning terminology. As prescribed by the theory of stochastic optimization (Robbins and Monro 1951), the gradients are averaged over a certain iteration history to build the search direction. Although stochastic optimization algorithms exhibit rapid reduction in the misfit function in the initial iterations, gradient averaging leads to slower convergence in the later iterations. Yang et al. (2018) hasten the convergence by preconditioning the gradient by structure oriented filters (Hale 2011) computed using migrated images. Van Leeuwen and Herrmann (2013) prescribed a gradual increase in the mini-batch size to benefit from the fast convergence associated with stochastic and conventional optimization algorithms in the initial and later iterations, respectively. They calculate the model updates using an l-BFGS (Nocedal and Wright 1999) based optimization scheme introduced by Friedlander and Schmidt (2012). However, the l-BFGS method needs to be suitably modfied for application in the stochastic setting. The proposed modifications involve uncoupling the curvature and gradient estimates by updating the inverse Hessian on a different schedule than the gradients (Byrd et al. 2014) or by updating the inverse Hessian using gradients from same set of sources in successive iterations (Ouellet et al. 2017). Matharu and Sachhi (2019) introduced a stochastic truncated Newton algorithm, wherein the Hessian-vector product is formed using a non-uniformly sampled set of sources and the second order adjoint-state method (Fichtner and Trampert 2011; Metivier et al. 2012).

The deep learning era has spawned the development of a variety of algorithmic extensions to stochastic gradient descent methods (Goodfellow et al. 2016). Richardson (2018) and Sun et al. (2020) showed that acoustic wave propagation can be simulated using a recurrent neural network (RNN), leading to the formulation of FWI as that of training the RNN. They compared several optimization algorithms in training the RNN and found that Adam (Kingma and Ba 2014), a stochastic gradient based algorithm with adaptive moment estimation converges rapidly and yields stable results. Here, we apply the Adam optimization algorithm to FWI with random shot selection, albeit without the need to simulate wave propagation as an RNN. We first review the methodology of stochastic FWI with mini-batches, and then discuss certain empirical criteria for the selection of hyperparameters with the aid of numerical examples. We then illustrate the performance of the algorithm on synthetic data generated from the Marmousi model. Finally, we compare the performance of the algorithm with conventional FWI, illustrate the dependence of FWI on the initial model and the frequency content of the data, and discuss the potential implementation of the presented algorithm to field data. Codes to reproduce the examples presented in this manuscript will be made publicly available.^{Footnote 1}

2 Stochastic full waveform inversion with mini-batches

The data-misfit function for full waveform inversion with random shot selection can be defined as:

$$\begin{aligned} {\mathcal{J}}_{B}({\theta })= & {} \frac{1}{2B} \sum _{i \in \Omega _{B}} \sum _{j=1}^{N_{r}} \int _{0}^{T} \left\| d^{\mathrm {obs}}\left( {{x}}^{i}_{s}, {{x}}^{j}_{r}, t\right) \right. \nonumber \\&\left. - d^{\mathrm {sim}}\left( x^{i}_{s}, x^{j}_{r}, t, {\theta }\right) \right\| ^{2}_2 \, {\mathrm {d}}t, \end{aligned}$$

(1)

where ${\theta }$ represents a suitably discretized subsurface model, $d^{\mathrm {obs}}$ and $d^{\mathrm {sim}}$ denote the observed and simulated data in the time domain and T denotes the total recording time. The coordinate of the ith source in a randomly chosen subset of sources $\Omega _{B}$ is denoted by ${{x}}^{i}_{s}$. The number of sources in the subset is denoted by B, and it may be chosen to be less than or equal to the total number of sources, denoted by $N_{s}$. The coordinates of the jth receiver is denoted by ${{x}}^{j}_{r}$, and a fixed receiver spread of $N_{r}$ receivers is assumed for all the shots. A fixed receiver spread is not a necessary condition and it is used here for the sake of convenience. Additionally, we assume a constant density acoustic wave equation to model and invert seismic data. Henceforth, the model parameter ${\theta }$ will refer to the P-wave velocity model.

Full waveform inversion with mini-batches proceeds by successively updating the model along a search direction:

$$\begin{aligned} {\theta }_{k+1} = {\theta }_{k} + \alpha _{k} \, {\mathbf{s}}_{k}, \end{aligned}$$

(2)

where $\alpha _{k}$ is a suitable step-length and ${\mathbf{s}}_{k}$ is the search direction. The subscripts in equation (2) denote the iteration number. In the stochastic gradient descent scheme, the search direction is chosen as the negative of the gradient, i.e., ${\mathbf{s}}_{k} = - {\it \partial} {\mathcal{J}}_{\text {B}} /{\it \partial} {\theta }_{k}$. The subset of shots chosen to form the misfit function ${\mathcal{J}}$_B and the gradient varies with each iteration. Hence, the gradient at each iteration is a noisy estimate of the true gradient and needs to be smoothed across iterations (Van Leeuwen and Herrmann 2013). The gradient may also be suitably preconditioned, for instance, by structure oriented filters (Hale 2011) computed using migrated images (Yang et al. 2018). Alternatively, second-order optimization methods suitably modified for the stochastic optimization setting can also be used to scale the gradient by an estimate of the inverse Hessian (Van Leeuwen and Herrmann 2013; Ouellet et al. 2017).

Here, we use the Adam (Kingma and Ba 2014) optimization algorithm to compute the updates to the velocity model. The algorithm builds the search direction by computing the first and second-order moments of gradient vectors weighted exponentially over the history of iterations to stabilize the direction of update and correct for the curvature of the loss function, respectively. For the sake of clarity, the pseudo-code for the Adam optimization algorithm taken from Kingma and Ba (2014) is reproduced with minor modifications in Appendix. Acoustic wave propagation is simulated with the finite-difference method with absorbing boundaries (including the top) using ‘Devito’,^{Footnote 2} a domain-specific language (DSL) and code generation framework (Luporini et al. 2018; Louboutin et al. 2019). The gradients are computed using the adjoint-state method (Plessix 2006) implemented within the Devito package and the updates to the velocity model according to the Adam optimization algorithm (Kingma and Ba 2014) are calculated using the Tensorflow package (Abadi et al. 2015).^{Footnote 3}

3 Numerical examples

We illustrate stochastic optimization strategies on a constant-density acoustic model modified from the Marmousi model (Versteeg 1994), displayed in figure 1(a). The data are modelled with 32 evenly spaced sources, each with the source signature as a Ricker wavelet with a peak frequency of 10 Hz (figure 1b). The receivers placed 30 m below the surface with an interval of 15 m in the x direction span the lateral extent of the model. The receiver spread is fixed for all the sources, although it is not a necessary condition for FWI with random shot selection. Figure 1(c) shows a shot record modelled from a source placed at the center of the true velocity model. Independent realizations of band-limited (5–50 Hz) random noise are added to the modelled data to simulate noisy data (figure 1d). The initial model used in the inversion displayed in figure 2(a), is obtained by smoothing the true model with a 20-point isotropic Gaussian filter. The shot record for the source at the center of the initial model and the corresponding data residuals are shown in figure 2(b and c), respectively.

3.1 Selection of hyperparameters

The hyperparameters in the Adam optimization algorithm, namely the learning rate $\alpha $, batch size, $\beta _{1}$, $\beta _{2}$, need to be set to values suitable for FWI. The symbols for the various hyperparameters are consistent with the notation used in Kingma and Ba (2014), for easy reference. The parameters $\beta _{1}$ and $\beta _{2}$ control the exponential weights in the smoothed estimates of first- and second-order moments of the gradient vector computed over past iterations. The smoothing weights approximately decrease by a factor of e in $1/(1-\beta _{1})$ iterations. Thus, the suggested value $\beta _{1}=0.9$ and $\beta _{2}=0.999$ (Kingma and Ba 2014) lead to an effective smoothing window of ~10 and 1000 iterations, respectively. A larger window is required for the second-order moment estimate in sparse settings (Kingma and Ba 2014). Most FWI applications do not involve a large number of iterations, furthermore, the FWI gradient is typically not sparse. Therefore, we set $\beta _{1}, \, \beta _{2} = 0.9$, effectively averaging over 10 iterations to estimate the first- and second-order gradient moments. We found that the results were fairly robust to a range of values for $\beta _{1}$ and $\beta _{2}$ between 0.8 and 0.95. The parameter $\epsilon $ stabilizes the division in the model update (Kingma and Ba 2014) and it is allowed to remain at the default value of $10^{-8}$.

3.2 Learning rate

The performance of the Adam optimization algorithm is sensitive to the value of the learning rate $\alpha $. The learning rate is usually found by performing a grid search with cross-validation over a range of possible values (Goodfellow et al. 2016; Richardson 2018). We choose to optimize with a constant learning rate, although learning rate decay may be used in stochastic gradient descent methods to improve convergence (Goodfellow et al. 2016). Sun et al. (2020) adapt the value of the learning rate by requiring the velocity model updates to be of the order of 10 m/s, since the magnitude of difference between the initial and true velocity models may be of the order 0–1000 m/s. Hence, they search for optimal learning rate in the interval [10, 100].

Here, we search for the optimal learning rate in the interval [$10^{-4}$, 1.0], since we use km/s for the units of the velocity models. We follow the strategy described in Smith (2017) to choose the learning rate for algorithms based on stochastic gradient descent. Although empirical, the criterion presented by Smith (2017) has proven effective in finding optimal values of learning rates in a variety of deep learning applications (Howard and Gugger 2020). The criterion involves training the model for a few iterations with variable and exponentially increasing learning rate values and choosing a value smaller than that at the minimum and wherein a significant rate of change of data-misfit value is observed. Figure 3 displays the data-misfit value (equation 1) for varying mini-batch sizes as a function of variable learning rate over 20 iterations of the optimization algorithm. The data-misfit achieves a minimum value around the learning rate $\alpha =0.1$ and the chosen learning rate should be in the interval [0.001, 0.1), where the learning rate decays rapidly. In order to illustrate the effect of the learning rate and the criterion discussed above on the choice of the optimal learning rate, we perform full waveform inversion with the Adam optimization algorithm and varying batch sizes. Figure 4 plots the data misfit (equation 1) as a function of ‘epochs’ (an epoch is defined as a single pass over all the sources) for mini-batch sizes of 4, 8, 16, and 32 and varying learning rates. The inversion diverges for the learning rate $\alpha =0.1$ for all the mini-batch sizes as the data misfit grows with epochs, due to large model updates (equation 2). Although a learning rate value of $\alpha =0.001$ leads to a monotonic decrease in the misfit function, the misfit function decreases most rapidly for $\alpha =0.01$. Indeed, $\alpha =0.01$ roughly corresponds to the inflection point in figure 3, supporting the empirical crtierion suggested by Howard and Gugger (2020). Hence, we choose a learning rate of 0.01 for all the mini-batches considered in the experiments below.

3.3 Mini-batch size

Friedlander and Schmidt (2012) and Van Leeuwen and Herrmann (2013) considered the error in the expected value of gradient computed from mini-batches to that from the full dataset. They found that the error in gradient is larger for the case of random sampling with replacement (i.e., the full set of shots are available for populating each mini-batch) than without replacement (i.e., each mini-batch is populated by unique sets of shots). Here, we follow the common practice in machine learning (Goodfellow et al. 2016) and form mini-batches with uniform random sampling and without replacement within an epoch. All the shots are made available for sampling after training an epoch (a full pass over 32 shots).

Figure 5 plots the data misfit in the logarithmic scale plotted as a function of the number of epochs. The data misfit for the experiments with mini-batch sizes 4 and 8 decrease rapidly as compared to those with mini-batch sizes of 16 and 32. The slight increase in the misfit function for latter epochs for mini-batch sizes 4 and 8 can be attributed to a reduction in the extent of noise fit by the inversion. Figure 6 displays the inverted velocity model as a function of epochs for the experiments with varying mini-batch sizes. As suggested by the cross-over of the dashed red and solid black lines in the plot of the misfit function (figure 5), the velocity model from FWI with a mini-batch size of 4 converges quickly towards the true velocity model in the initial epochs, with the inverted velocity model from mini-batch size of 8 getting closer to the true velocity model at later epochs.

Figure 7(a) displays the inverted velocity model retrieved after 64 epochs for the algorithm implemented with the mini-batch size of 4. The difference between the true and inverted velocity model is shown in figure 7(b). The inverted velocity model exhibits high resolution and is close to the true model. The largest deviations are observed close to the edges of the model due to low data foldage. Shot records and data residuals from the center of the inverted model are displayed in figure 7(c and d). As compared to the initial residuals (figure 2b), an order of magnitude reduction can be observed for the residuals corresponding to the inverted model. Figure 8 displays the inverted velocity model, the deviation from the true model, the predicted data and residuals from the center of the model retrieved after 64 epochs for the algorithm implemented with the mini-batch size of 8. The result from the inversion with mini-batch size of 4 is closer to the true velocity model (figure 7b) as compared to that from inversion with a mini-batch size of 8 (figure 8b). However, since the computation of simulated data and the gradient corresponding to each shot is an ‘embarassingly parallel’ problem, (the computation time required for running 64 epochs with a mini-batch size of 8 is two times lesser than that for the experiment with a mini-batch size of 4. The nominal computational cost associated with both the mini-batch sizes is the same since both the experiments involve the same number of forward and adjoint simulations. Inversion with larger mini-batch sizes requires more iterations to produce velocity models close to the true model, although the computation time per iteration maybe low, provided sufficient number of parallel threads are available. However, an increase in the number of iterations leads to an increase in the computation cost due to the increase in the number of forward and adjoint simulations. Indeed, McCandlish et al. (2018) argue that the trade-off between computational time and cost in training deep neural networks to achieve a specified level of performance takes the form of a Pareto frontier.

4 Discussion

The numerical experiments outlined in the previous section illustrate the efficacy of the Adam optimization algorithm in FWI with random shot selection. In this section, we compare the results with those from conventional FWI, illustrate the ‘cycle skipping’ problem with the dependence of the algorithm on the initial model the frequency content of the data. Finally, we discuss the implementation of the algorithm on field data.

4.1 Comparison with conventional FWI

We now illustrate the performance of conventional FWI performed using the l-BFGS method with a full-batch of 32 shots without added band-limited noise. The initial model for the inversion is chosen to be the same as that for the Adam algorithm. The l-BFGS algorithm involves a line search to perform the model updates (Nocedal and Wright 1999) and multiple forward and adjoint simulations maybe necessary per single l-BFGS iteration. In order to compare the results from Adam optimization and l-BFGS methods at the same computational cost (i.e., number of forward and adjoint simulations), we terminate the l-BFGS algorithm after 64 misfit function evaluations. Figure 9 displays the misfit function, inverted velocity model, and the initial and final data residuals corresponding to a source placed at the center of the model. It can be observed that the result from the Adam optimization algorithm with random shot selection is far superior than those obtained by conventional FWI. Second-order methods including the Gauss–Newton or l-BFGS method, though widely used in FWI (Virieux and Operto 2009) are not regularly employed in training deep neural networks as they have been found to terminate near saddle points (Goodfellow et al. 2016). However, accounting for the Hessian matrix is crucial in FWI to resolve trade-off between different parameters encountered in for instance, elastic FWI (Virieux and Operto 2009). We plan to test the suitability of first-order stochastic gradient-descent algorithms like Adam in multi-parameter FWI problems in future work.

4.2 Dependence on initial model, frequency content of the data: Cycle skipping

It is well known that the performance of FWI implemented with the $l^{2}$ norm data misfit (e.g., equation 1) is dependent on the choice of the initial model and the bandwidth of the data utilized in the inversion (Bunks et al. 1995; Virieux and Operto 2009; Metivier et al. 2016). The initial model should be chosen such that the predicted and the observed data are within half the period of the signal. FWI with an initial model that violates this requirement converges to a local minimum and predicts incorrect velocity models and leads to an unsatisfactory match with the observed data; a phenomenon termed as cycle skipping (Virieux and Operto 2009). To address the issue of cycle skipping, Bunks et al. (1995) proposed a hierarchical strategy that inverts data in narrow frequency bands starting with lower frequencies and proceeding to higher frequencies. This strategy can be implemented either in time (Bunks et al. 1995) or frequency domain FWI (Sirgue and Pratt 2004; Operto et al. 2004).

Since the data misfit (equation 1) is based on the $l^{2}$ distance between observed and predicted data, the algorithm presented here also suffers from the cycle skipping problem, if the chosen initial model does not predict data within half a cycle of the observed data. In order to illustrate this phenomenon, we simulate noiseless data with a Ricker wavelet centred at 25 Hz (figure 10). As with the earlier experiments, the initial model is chosen to be the one shown in figure 2(a). The velocity model obtained after 64 epochs of the Adam optimization algorithm implemented with a mini-batch size of 4 and a learning rate of 0.01 is shown in figure 11. The initial and final data residuals from the inversion are displayed in figure 12. The inversion result is unsatisfactory due to cycle skipping as evident from figures 12(c and d), as the initial model predicts data by more than half a cycle away from the true data. Consequently, the data predicted at the final iteration do not match the phase of the true data, indicating cycle skipping. Figure 13(a) displays a velocity model obtained by the application of a 20-point isotropic Gaussian smoothing filter to the true velocity model (figure 1a). The corresponding inverted velocity model (figure 13b) is closer to the true velocity model. The initial and final residuals are shown in figure 14 and it can be observed that the predicted initial data is within half a cycle of the true data and the data predicted at the final iteration display a better match with the true data as compared to figure 12. However, the velocity model obtained from inversion of data obtained from the inversion of higher frequency data (generated by Ricker wavelet centred at 25 Hz) does not compare well with that obtained from the inversion of narrowband, lower frequency data (generated by Ricker wavelet centred at 10 Hz). Although the initial velocity model does predict the first arrival to within half cycle of the observed data, later reflection events are not within the half cycle, leading the inversion to converge to a local minimum. This can be mitigated by following a hierarchical inversion strategy wherein narrow bands of data are inverted progressively from lower to higher frequencies. Filtering the data in the bandwidth of 1–20 Hz produced results identical to those presented in the last section (figure 7).

Low frequency data with a good signal-to-noise ratio is crucial for the success of FWI. Ovcharenko et al. (2019) demonstrated the ability of deep convolutional neural network to predict data at 0.25 Hz from data between 2 and 4.5 Hz modelled from synthetic benchmark models. Similarly, Sun and Demanet (2020) extrapolated data in the 0.1–5 Hz band from synthetic data in the 5–20 Hz band modelled from the Marmousi model. Dellinger et al. (2016) and Brenders et al. (2020) presented an ultra low-frequency marine source that can yield robust data in a bandwidth of 1.75–2.5 Hz at long offsets. Alternatively, the cycle skipping problem can be mitigated by defining the misfit based on an optimal transport distance (Metivier et al. 2016, 2018; Yang and Engquist 2019) to convexify the non-convex misfit function based on the $l^2$ norm. In summary, the modifications to the misfit function, inclusion of low frequency data, or a hierarchical inversion strategy can be suitably combined with the algorithm presented here to produce accurate subsurface models from field data.

4.3 Towards implementation on field data

While we have presented the performance of the algorithm on the benchmark Marmousi model, future work will focus on implementing the algorithm on a suitable field dataset. Here, we outline a processing flow towards a potential implementation of the algorithm to field data. The source wavelet was assumed to be known exactly in the presented methodology. If the source wavelet is not available or accurately recorded for field data, it can be estimated by the methodology presented by Pratt (1999) and Plessix and Cao (2011). The initial velocity model can be estimated by first arrival traveltime tomography (Jaiswal et al. 2008, 2009; Prieux et al. 2009), by Laplace–Fourier domain inversion (Shin and Cha 2009) or with the application of global optimization methods (Datta and Sen 2016). It should be ensured that the initial velocity model predicts data within half a cycle of observed data. A hierarchical inversion strategy can then be followed, by applying the Adam optimization algorithm with random shot selection as presented above.

5 Conclusions

We presented a strategy for full waveform inversion with random shot selection based on Adam, a first-order stochastic gradient descent algorithm. The algorithm builds the search direction for model updates based on exponentially weighted first- and second-order moment estimates of the gradients from successive iterations. We outline an empirical strategy to select the hyperparameters of the Adam algorithm, particularly the learning rate, and the optimal mini-batch size for random shot selection. The algorithm is tested on synthetic data from the Marmousi model. We found that full waveform inversion with random shot selection yields superior results and converges faster than conventional FWI implemented with the l-BFGS method.

Notes

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y and Zheng X 2015 TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems; Software available from tensorflow.org.
Brenders A, Dellinger J, Ahmed I, Diaz E, Gherasim M, Jin H and Vyas M 2020 The Wolfspar experience: Learnings from processing sparse, low-frequency seismic data; In: 90th Annual International Meeting, SEG, Expanded Abstracts.
Bunks C, Salek F M, Zaleski S and Chavent G 1995 Multiscale seismic waveform inversion; Geophysics 60(5) 1457–1473.
Article Google Scholar
Byrd R H, Hansen S L, Nocedal J and Singer Y 2014 A stochastic quasi-Newton method for large-scale optimization; arXiv preprint arXiv:1401.7020.
Datta D and Sen M K 2016 Estimating a starting model for full-waveform inversion using a global optimization method; Geophysics 81(4) R211–R223.
Article Google Scholar
Dellinger J, Ross A, Meaux D, Brenders A, Gesoff G, Etgen J, Naranjo J, Openshaw G and Harper M 2016 Wolfspar, an FWI-friendly ultralow-frequency marine seismic source; In: 86th Annual International Meeting, SEG, Expanded Abstracts.
Diaz E and Guitton A 2011 Fast full waveform inversion with random shot decimation; In: 79th Annual International Meeting, SEG, Expanded Abstracts.
Ouellet F G, Gloaguen E and Giroux B 2017 A stochastic L-BFGS approach for full waveform inversion; In: 87th Annual International Meeting, SEG, Expanded Abstracts.
Fichtner A and Trampert J 2011 Hessian kernels of seismic data functionals based upon adjoint techniques; Geophys. J. Int. 185 775–798.
Article Google Scholar
Friedlander M P and Schmidt M 2012 Hybrid determinisitic-stochastic methods for data fitting; SIAM J. Sci. Comput. 34(3) A1380–A1405.
Article Google Scholar
Goodfellow I, Bengio Y and Courville A 2016 Deep Learning; MIT Press, Cambridge.
Google Scholar
Hale D 2011 Structure-oriented bilateral filtering of seismic images; In: 81st Annual International Meeting, SEG, Expanded Abstracts.
Howard J and Gugger S 2020 Fastai: A layered API for deep learning; Information 11(2) 108.
Article Google Scholar
Jaiswal P, Zelt C, Bally A W and Dasgupta R 2008 2-D travel-time and waveform inversion for improved seismic imaging: Naga thrust and fold belt, India; Geophys. J. Int. 173(2) 642–658.
Article Google Scholar
Jaiswal P, Zelt C A, Dasgupta R and Nath K K 2009 Seismic imaging of the Naga thrust using multiscale waveform inversion; Geophysics 74(6) WCC129–WCC140.
Article Google Scholar
Kingma D P and Ba J 2014 Adam: A method for stochastic optimization; arXiv preprint arXiv:1412.6980.
Krebs J R, Anderson J E, Hinkley D, Neelamani R, Lee S, Baumstein A and Lacasse M D 2009 Fast full-wavefield seismic inversion using encoded sources; Geophysics 74(6) WCC177–WCC188.
Article Google Scholar
Louboutin M, Lange M, Luporini F, Kukreja N, Witte P A, Herrmann F J, Velesko P and Gorman G J 2019 Devito (v3.1.0): An embedded domain-specific language for finite differences and geophysical exploration; Geosci. Model Dev. 12(3) 1165–1187.
Article Google Scholar
Luporini F, Lange M, Louboutin M, Kukreja N, Huckelheim J, Yount C, Witte P, Kelly P H J, Herrmann F J and Gorman G J 2018 Architecture and performance of devito, a system for automated stencil computation; CoRRabs/1807.03032.
Matharu G and Sachhi M 2018 Source encoding in multiparameter full waveform inversion; Geophys. J. Int. 214(2) 792–810.
Article Google Scholar
Matharu G and Sachhi M 2019 A subsampled truncated-Newton method for multiparameter full-waveform inversion; Geophysics 84(3) R333–R340.
Article Google Scholar
McCandlish S, Kaplan J, Amodei D and Open AI Dota Team 2018 An empirical model of large batch training; arXiv preprint arXiv:1812.06162.
Metivier L, Allain A, Brossier R, Merigot Q, Oudet E and Virieux J 2018 Optimal transport for mitigating cycle skipping in full-waveform inversion: A graph-space transform approach; Geophysics 83(5) R515–R540.
Article Google Scholar
Metivier L, Brossier R, Merigot Q, Oudet E and Virieux J 2016 Measuring the misfit between seismograms using an optimal transport distance: Application to full waveform inversion; Geophys. J. Int. 205 345–377.
Article Google Scholar
Metivier L, Brossier R, Operto S and Virieux J 2012 Second-order adjoint state methods for full waveform inversion; In: 74th Conference and Technical Exhibition: EAGE, Extended Abstracts.
Neelamani R and Krohn C E 2008 Simultaneous sourcing without compromise; In: 70th Conference and Technical Exhibition: EAGE, Extended Abstracts.
Nocedal J and Wright S J 1999 Numerical Optimization; Springer, Berlin.
Book Google Scholar
Operto S, Ravaut C, Improta L, Virieux J, Herrero A and DellAversana P 2004 Quantitative imaging of complex structures from dense wide-aperture seismic data by multiscale travel time and waveform inversions: A case study; Geophys. Prospect. 52(6) 625–651.
Article Google Scholar
Ovcharenko O, Kazei V, Kalita M, Peter D and Alkhalifah T 2019 Deep learning for low-frequency extrapolation from multioffset seismic data; Geophysics 84(6) R989–R1001.
Article Google Scholar
Perrone F and Sava P 2009 Comparison of shot encoding functions for reverse-time migration; In: 79th Annual International Meeting, SEG, Expanded Abstracts.
Plessix R E 2006 A review of the adjoint-state method for computing the gradient of a functional with geophysical applications; Geophys. J. Int. 167(2) 495–503.
Article Google Scholar
Plessix R E and Cao Q 2011 A parametrization study for surface seismic full waveform inversion in an acoustic vertical transversely isotropic medium; Geophys. J. Int. 185(1) 539–556.
Article Google Scholar
Pratt R G 1999 Seismic waveform inversion in the frequency domain, Part I: Theory and verification in a physical scale model; Geophysics 64(3) 888–901.
Article Google Scholar
Prieux V, Operto S, Brossier R and Virieux J 2009 Application of acoustic full waveform inversion to the synthetic Valhall model; In: 77th Annual International Meeting, SEG, Expanded Abstracts.
Rao Y and Wang Y 2017 Seismic waveform tomography with shot-encoding using a restarted L-BFGS algorithm; Sci. Rep. 7 8494.
Article Google Scholar
Richardson A 2018 Seismic full-waveform inversion using deep learning tools and techniques; arXiv preprint arXiv:1801.07232.
Robbins H and Monro S 1951 Robust stochastic approximation approach to stochastic programming; Ann. Math. Stat. 22 400–407.
Article Google Scholar
Romero L A, Ghiglia D C, Ober C C and Morton S A 2000 Phase encoding of shot records in prestack migration; Geophysics 65 426–436.
Article Google Scholar
Shin C and Cha Y H 2009 Waveform inversion in the Laplace–Fourier domain; Geophys. J. Int. 173 922–931.
Article Google Scholar
Sirgue L and Pratt R G 2004 Efficient waveform inversion and imaging: A strategy for selecting temporal frequencies; Geophysics 69(1) 231–248.
Article Google Scholar
Smith L N 2017 Cyclical learning rates for training neural networks; In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) 2017 464–472.
Sun H and Demanet L 2020 Extrapolated full-waveform inversion with deep learning; Geophysics 85(3) R275–R288.
Article Google Scholar
Sun J, Niu Z, Innanen K A, Li J and Trad D O 2020 A theory-guided deep-learning formulation and optimization of seismic waveform inversion; Geophysics 85(2) 1MA–Z8.
Article Google Scholar
Tu N and Herrmann F J 2012 Least-squares migration of full wavefield with source encoding; In: 74th Conference and Technical Exhibition: EAGE, Extended Abstracts.
Van Leeuwen T, Aravkin A Y and Herrmann F J 2011 Seismic waveform inversion by stochastic optimization; Int. J. Geophys. 2011 1–18.
Article Google Scholar
Van Leeuwen T and Herrmann F J 2013 Fast waveform inversion without source-encoding; Geophys. Prospect. 61 (Suppl. 1) 10–19.
Article Google Scholar
Versteeg R 1994 The Marmousi experience: Velocity model determination on a synthetic complex data set; Lead. Edge 13(9) 927–936.
Article Google Scholar
Virieux J and Operto S 2009 An overview of full-waveform inversion in exploration geophysics; Geophysics 74(6) WCC1–WCC26.
Article Google Scholar
Yang H, Jia J, Wu B and Gao J 2018 Mini-batch optimized full waveform inversion with geological constrained gradient filtering; J. Appl. Geophys. 152 9–16.
Article Google Scholar
Yang Y and Engquist B 2019 Improving optimal transport based FWI through data normalization; In: 89th Annual International Meeting, SEG, Expanded Abstracts 89 1315–1318.

Download references

Acknowledgements

We would like to thank the Associate Editor Dr Arkoprovo Biswas and two anonymous reviewers for their constructive feedback.

Author information

Authors and Affiliations

Department of Earth Sciences, Indian Institute of Technology Bombay, Mumbai, India
Kuldeep & Bharath Shekar

Authors

Kuldeep
View author publications
You can also search for this author in PubMed Google Scholar
Bharath Shekar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Kuldeep formulated the research problem, wrote computer code to produce the examples and wrote the initial draft of the manuscript. Bharath Shekar provided research guidance and edited the manuscript.

Corresponding author

Correspondence to Bharath Shekar.

Additional information

Communicated by Arkoprovo Biswas

Appendix

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuldeep, Shekar, B. Full waveform inversion with random shot selection using adaptive gradient descent. J Earth Syst Sci 130, 183 (2021). https://doi.org/10.1007/s12040-021-01679-y

Download citation

Received: 27 November 2020
Revised: 17 April 2021
Accepted: 24 April 2021
Published: 07 September 2021
DOI: https://doi.org/10.1007/s12040-021-01679-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Full waveform inversion with random shot selection using adaptive gradient descent

Abstract

Similar content being viewed by others

Fast ABC with Joint Generative Modelling and Subset Simulation

Convolutional Neural Network-Assisted Least-Squares Migration

Gauss-Newton and L-BFGS Methods in Full Waveform Inversion (FWI)

1 Introduction

2 Stochastic full waveform inversion with mini-batches