1 Introduction

Full waveform inversion (FWI) is a computationally intensive technique partly due to its requirement of repeated wave simulations (Virieux and Operto 2009). Source encoding and random shot selection are two commonly employed starategies to decrease the computational cost of FWI. Source encoding can be implemented either during acquisition (Neelamani and Krohn 2008) by simultaneously injecting multiple sources with varying source signatures or during migration (Romero et al. 2000; Perrone and Sava 2009; Tu and Herrmann 2012) and FWI (Krebs et al. 2009; Van Leeuwen et al. 2011; Matharu and Sachhi 2018; Rao and Wang 2017) by convolving sequentially fired sources with randomly generated encoding sequences. The encoded sources act as a single effective spatially extended source, thereby reducing the number of per-source wavefield iterations.

Random shot selection involves approximating the misfit and the gradient by a randomly chosen subset of shots at each iteration (Diaz and Guitton 2011; Van Leeuwen and Herrmann 2013; Ouellet et al. 2017; Yang et al. 2018). The subset of selected shots constitutes a ‘mini-batch’ (Yang et al. 2018), borrowing from machine learning terminology. As prescribed by the theory of stochastic optimization (Robbins and Monro 1951), the gradients are averaged over a certain iteration history to build the search direction. Although stochastic optimization algorithms exhibit rapid reduction in the misfit function in the initial iterations, gradient averaging leads to slower convergence in the later iterations. Yang et al. (2018) hasten the convergence by preconditioning the gradient by structure oriented filters (Hale 2011) computed using migrated images. Van Leeuwen and Herrmann (2013) prescribed a gradual increase in the mini-batch size to benefit from the fast convergence associated with stochastic and conventional optimization algorithms in the initial and later iterations, respectively. They calculate the model updates using an l-BFGS (Nocedal and Wright 1999) based optimization scheme introduced by Friedlander and Schmidt (2012). However, the l-BFGS method needs to be suitably modfied for application in the stochastic setting. The proposed modifications involve uncoupling the curvature and gradient estimates by updating the inverse Hessian on a different schedule than the gradients (Byrd et al. 2014) or by updating the inverse Hessian using gradients from same set of sources in successive iterations (Ouellet et al. 2017). Matharu and Sachhi (2019) introduced a stochastic truncated Newton algorithm, wherein the Hessian-vector product is formed using a non-uniformly sampled set of sources and the second order adjoint-state method (Fichtner and Trampert 2011; Metivier et al. 2012).

The deep learning era has spawned the development of a variety of algorithmic extensions to stochastic gradient descent methods (Goodfellow et al. 2016). Richardson (2018) and Sun et al. (2020) showed that acoustic wave propagation can be simulated using a recurrent neural network (RNN), leading to the formulation of FWI as that of training the RNN. They compared several optimization algorithms in training the RNN and found that Adam (Kingma and Ba 2014), a stochastic gradient based algorithm with adaptive moment estimation converges rapidly and yields stable results. Here, we apply the Adam optimization algorithm to FWI with random shot selection, albeit without the need to simulate wave propagation as an RNN. We first review the methodology of stochastic FWI with mini-batches, and then discuss certain empirical criteria for the selection of hyperparameters with the aid of numerical examples. We then illustrate the performance of the algorithm on synthetic data generated from the Marmousi model. Finally, we compare the performance of the algorithm with conventional FWI, illustrate the dependence of FWI on the initial model and the frequency content of the data, and discuss the potential implementation of the presented algorithm to field data. Codes to reproduce the examples presented in this manuscript will be made publicly available.Footnote 1

2 Stochastic full waveform inversion with mini-batches

The data-misfit function for full waveform inversion with random shot selection can be defined as:

$$\begin{aligned} {\mathcal{J}}_{B}({\theta })= & {} \frac{1}{2B} \sum _{i \in \Omega _{B}} \sum _{j=1}^{N_{r}} \int _{0}^{T} \left\| d^{\mathrm {obs}}\left( {{x}}^{i}_{s}, {{x}}^{j}_{r}, t\right) \right. \nonumber \\&\left. - d^{\mathrm {sim}}\left( x^{i}_{s}, x^{j}_{r}, t, {\theta }\right) \right\| ^{2}_2 \, {\mathrm {d}}t, \end{aligned}$$
(1)

where \({\theta }\) represents a suitably discretized subsurface model, \(d^{\mathrm {obs}}\) and \(d^{\mathrm {sim}}\) denote the observed and simulated data in the time domain and T denotes the total recording time. The coordinate of the ith source in a randomly chosen subset of sources \(\Omega _{B}\) is denoted by \({{x}}^{i}_{s}\). The number of sources in the subset is denoted by B, and it may be chosen to be less than or equal to the total number of sources, denoted by \(N_{s}\). The coordinates of the jth receiver is denoted by \({{x}}^{j}_{r}\), and a fixed receiver spread of \(N_{r}\) receivers is assumed for all the shots. A fixed receiver spread is not a necessary condition and it is used here for the sake of convenience. Additionally, we assume a constant density acoustic wave equation to model and invert seismic data. Henceforth, the model parameter \({\theta }\) will refer to the P-wave velocity model.

Full waveform inversion with mini-batches proceeds by successively updating the model along a search direction:

$$\begin{aligned} {\theta }_{k+1} = {\theta }_{k} + \alpha _{k} \, {\mathbf{s}}_{k}, \end{aligned}$$
(2)

where \(\alpha _{k}\) is a suitable step-length and \({\mathbf{s}}_{k}\) is the search direction. The subscripts in equation (2) denote the iteration number. In the stochastic gradient descent scheme, the search direction is chosen as the negative of the gradient, i.e., \({\mathbf{s}}_{k} = - {\it \partial} {\mathcal{J}}_{\text {B}} /{\it \partial} {\theta }_{k}\). The subset of shots chosen to form the misfit function \({\mathcal{J}}\)B and the gradient varies with each iteration. Hence, the gradient at each iteration is a noisy estimate of the true gradient and needs to be smoothed across iterations (Van Leeuwen and Herrmann 2013). The gradient may also be suitably preconditioned, for instance, by structure oriented filters (Hale 2011) computed using migrated images (Yang et al. 2018). Alternatively, second-order optimization methods suitably modified for the stochastic optimization setting can also be used to scale the gradient by an estimate of the inverse Hessian (Van Leeuwen and Herrmann 2013; Ouellet et al. 2017).

Here, we use the Adam (Kingma and Ba 2014) optimization algorithm to compute the updates to the velocity model. The algorithm builds the search direction by computing the first and second-order moments of gradient vectors weighted exponentially over the history of iterations to stabilize the direction of update and correct for the curvature of the loss function, respectively. For the sake of clarity, the pseudo-code for the Adam optimization algorithm taken from Kingma and Ba (2014) is reproduced with minor modifications in Appendix. Acoustic wave propagation is simulated with the finite-difference method with absorbing boundaries (including the top) using ‘Devito’,Footnote 2 a domain-specific language (DSL) and code generation framework (Luporini et al. 2018; Louboutin et al. 2019). The gradients are computed using the adjoint-state method (Plessix 2006) implemented within the Devito package and the updates to the velocity model according to the Adam optimization algorithm (Kingma and Ba 2014) are calculated using the Tensorflow package (Abadi et al. 2015).Footnote 3

3 Numerical examples

We illustrate stochastic optimization strategies on a constant-density acoustic model modified from the Marmousi model (Versteeg 1994), displayed in figure 1(a). The data are modelled with 32 evenly spaced sources, each with the source signature as a Ricker wavelet with a peak frequency of 10 Hz (figure 1b). The receivers placed 30 m below the surface with an interval of 15 m in the x direction span the lateral extent of the model. The receiver spread is fixed for all the sources, although it is not a necessary condition for FWI with random shot selection. Figure 1(c) shows a shot record modelled from a source placed at the center of the true velocity model. Independent realizations of band-limited (5–50 Hz) random noise are added to the modelled data to simulate noisy data (figure 1d). The initial model used in the inversion displayed in figure 2(a), is obtained by smoothing the true model with a 20-point isotropic Gaussian filter. The shot record for the source at the center of the initial model and the corresponding data residuals are shown in figure 2(b and c), respectively.

Figure 1
figure 1

Synthetic model used to generate observed data: (a) True P-wave velocity model modified from the Marmousi model. The source placed at the center of the model is denoted by a red star and the receivers are denoted by a black line. Ricker wavelet centred at 10  Hz plotted in the (b) time and (c) frequency domain. (d) Modelled data from a shot placed at the center of the true model and the source signature as shown in (b). (e) Noisy shot record simulated by adding band-limited noise to (c) such that the signal-to-noise ratio is 5.0.

Figure 2
figure 2

(a) Initial P-wave velocity model used in the inversion, obtained by applying a 50-point isotropic Gaussian filter to the true model. The source placed at the center of the model is denoted by a red star and the receivers are denoted by a black line. (b) Modelled data from the shot placed at the center of the true model and the source signature as shown in figure 1(b). (c) Initial data residuals obtained by subtracting the data shown in (b) from the data displayed in figure 1(d).

3.1 Selection of hyperparameters

The hyperparameters in the Adam optimization algorithm, namely the learning rate \(\alpha \), batch size, \(\beta _{1}\), \(\beta _{2}\), need to be set to values suitable for FWI. The symbols for the various hyperparameters are consistent with the notation used in Kingma and Ba (2014), for easy reference. The parameters \(\beta _{1}\) and \(\beta _{2}\) control the exponential weights in the smoothed estimates of first- and second-order moments of the gradient vector computed over past iterations. The smoothing weights approximately decrease by a factor of e in \(1/(1-\beta _{1})\) iterations. Thus, the suggested value \(\beta _{1}=0.9\) and \(\beta _{2}=0.999\) (Kingma and Ba 2014) lead to an effective smoothing window of ~10 and 1000 iterations, respectively. A larger window is required for the second-order moment estimate in sparse settings (Kingma and Ba 2014). Most FWI applications do not involve a large number of iterations, furthermore, the FWI gradient is typically not sparse. Therefore, we set \(\beta _{1}, \, \beta _{2} = 0.9\), effectively averaging over 10 iterations to estimate the first- and second-order gradient moments. We found that the results were fairly robust to a range of values for \(\beta _{1}\) and \(\beta _{2}\) between 0.8 and 0.95. The parameter \(\epsilon \) stabilizes the division in the model update (Kingma and Ba 2014) and it is allowed to remain at the default value of \(10^{-8}\).

3.2 Learning rate

The performance of the Adam optimization algorithm is sensitive to the value of the learning rate \(\alpha \). The learning rate is usually found by performing a grid search with cross-validation over a range of possible values (Goodfellow et al. 2016; Richardson 2018). We choose to optimize with a constant learning rate, although learning rate decay may be used in stochastic gradient descent methods to improve convergence (Goodfellow et al. 2016). Sun et al. (2020) adapt the value of the learning rate by requiring the velocity model updates to be of the order of 10 m/s, since the magnitude of difference between the initial and true velocity models may be of the order 0–1000 m/s. Hence, they search for optimal learning rate in the interval [10, 100].

Here, we search for the optimal learning rate in the interval [\(10^{-4}\), 1.0], since we use km/s for the units of the velocity models. We follow the strategy described in Smith (2017) to choose the learning rate for algorithms based on stochastic gradient descent. Although empirical, the criterion presented by Smith (2017) has proven effective in finding optimal values of learning rates in a variety of deep learning applications (Howard and Gugger 2020). The criterion involves training the model for a few iterations with variable and exponentially increasing learning rate values and choosing a value smaller than that at the minimum and wherein a significant rate of change of data-misfit value is observed. Figure 3 displays the data-misfit value (equation 1) for varying mini-batch sizes as a function of variable learning rate over 20 iterations of the optimization algorithm. The data-misfit achieves a minimum value around the learning rate \(\alpha =0.1\) and the chosen learning rate should be in the interval [0.001, 0.1), where the learning rate decays rapidly. In order to illustrate the effect of the learning rate and the criterion discussed above on the choice of the optimal learning rate, we perform full waveform inversion with the Adam optimization algorithm and varying batch sizes. Figure 4 plots the data misfit (equation 1) as a function of ‘epochs’ (an epoch is defined as a single pass over all the sources) for mini-batch sizes of 4, 8, 16, and 32 and varying learning rates. The inversion diverges for the learning rate \(\alpha =0.1\) for all the mini-batch sizes as the data misfit grows with epochs, due to large model updates (equation 2). Although a learning rate value of \(\alpha =0.001\) leads to a monotonic decrease in the misfit function, the misfit function decreases most rapidly for \(\alpha =0.01\). Indeed, \(\alpha =0.01\) roughly corresponds to the inflection point in figure 3, supporting the empirical crtierion suggested by Howard and Gugger (2020). Hence, we choose a learning rate of 0.01 for all the mini-batches considered in the experiments below.

Figure 3
figure 3

Data-misfit values for varying mini-batch sizes as a function of exponentially increasing learning rate plotted in the log–log scale. The optimal learning rate should be less than the value at the minimum(\(\alpha \approx 0.1\)) and can be chosen from among the values where the loss function decreases rapidly.

Figure 4
figure 4

Data-misfit values for varying learning rates plotted in the log–log scale for mini-batch size of (a) 4, (b) 8, (c) 16, and (d) 32.

3.3 Mini-batch size

Friedlander and Schmidt (2012) and Van Leeuwen and Herrmann (2013) considered the error in the expected value of gradient computed from mini-batches to that from the full dataset. They found that the error in gradient is larger for the case of random sampling with replacement (i.e., the full set of shots are available for populating each mini-batch) than without replacement (i.e., each mini-batch is populated by unique sets of shots). Here, we follow the common practice in machine learning (Goodfellow et al. 2016) and form mini-batches with uniform random sampling and without replacement within an epoch. All the shots are made available for sampling after training an epoch (a full pass over 32 shots).

Figure 5 plots the data misfit in the logarithmic scale plotted as a function of the number of epochs. The data misfit for the experiments with mini-batch sizes 4 and 8 decrease rapidly as compared to those with mini-batch sizes of 16 and 32. The slight increase in the misfit function for latter epochs for mini-batch sizes 4 and 8 can be attributed to a reduction in the extent of noise fit by the inversion. Figure 6 displays the inverted velocity model as a function of epochs for the experiments with varying mini-batch sizes. As suggested by the cross-over of the dashed red and solid black lines in the plot of the misfit function (figure 5), the velocity model from FWI with a mini-batch size of 4 converges quickly towards the true velocity model in the initial epochs, with the inverted velocity model from mini-batch size of 8 getting closer to the true velocity model at later epochs.

Figure 5
figure 5

Data-misfit values corresponding to the learning rate \(\alpha =0.01\) plotted in the log–log scale as a function of epochs.

Figure 6
figure 6

Velocity models retrieved from the Adam algorithm as a function of varying mini-batch size (M.B.) and epochs. The learning rate is fixed at \(\alpha =0.01\).

Figure 7(a) displays the inverted velocity model retrieved after 64 epochs for the algorithm implemented with the mini-batch size of 4. The difference between the true and inverted velocity model is shown in figure 7(b). The inverted velocity model exhibits high resolution and is close to the true model. The largest deviations are observed close to the edges of the model due to low data foldage. Shot records and data residuals from the center of the inverted model are displayed in figure 7(c and d). As compared to the initial residuals (figure 2b), an order of magnitude reduction can be observed for the residuals corresponding to the inverted model. Figure 8 displays the inverted velocity model, the deviation from the true model, the predicted data and residuals from the center of the model retrieved after 64 epochs for the algorithm implemented with the mini-batch size of 8. The result from the inversion with mini-batch size of 4 is closer to the true velocity model (figure 7b) as compared to that from inversion with a mini-batch size of 8 (figure 8b). However, since the computation of simulated data and the gradient corresponding to each shot is an ‘embarassingly parallel’ problem, (the computation time required for running 64 epochs with a mini-batch size of 8 is two times lesser than that for the experiment with a mini-batch size of 4. The nominal computational cost associated with both the mini-batch sizes is the same since both the experiments involve the same number of forward and adjoint simulations. Inversion with larger mini-batch sizes requires more iterations to produce velocity models close to the true model, although the computation time per iteration maybe low, provided sufficient number of parallel threads are available. However, an increase in the number of iterations leads to an increase in the computation cost due to the increase in the number of forward and adjoint simulations. Indeed, McCandlish et al. (2018) argue that the trade-off between computational time and cost in training deep neural networks to achieve a specified level of performance takes the form of a Pareto frontier.

Figure 7
figure 7

Inversion results after 64 epochs and a mini-batch size of 4: (a) inverted velocity model, (b) difference between the true (figure 1) and inverted model, (c) shot record from the center of the inverted model, and (d) the corresponding data residuals. The learning rate is fixed at \(\alpha =0.01\).

Figure 8
figure 8

Inversion results after 64 epochs and a mini-batch size of 8: (a) inverted velocity model, (b) difference between the true (figure 1) and inverted model, (c) shot record from the center of the inverted model, and (d) the corresponding data residuals. The learning rate is fixed at \(\alpha =0.01\).

4 Discussion

The numerical experiments outlined in the previous section illustrate the efficacy of the Adam optimization algorithm in FWI with random shot selection. In this section, we compare the results with those from conventional FWI, illustrate the ‘cycle skipping’ problem with the dependence of the algorithm on the initial model the frequency content of the data. Finally, we discuss the implementation of the algorithm on field data.

Figure 9
figure 9

Inversion results with conventional FWI implemented with the l-BFGS algorithm. The initial model is the same as that in figure 2(a) and 32 shots modelled without added noise are inverted. (a) Data misfit as a function of misfit function evaluations. (b) Inverted velocity model obtained after 64 function evaluations. Data residuals for a shot placed at the center of the model corresponding to (c) the initial model, and (d) the inverted model from (b).

4.1 Comparison with conventional FWI

We now illustrate the performance of conventional FWI performed using the l-BFGS method with a full-batch of 32 shots without added band-limited noise. The initial model for the inversion is chosen to be the same as that for the Adam algorithm. The l-BFGS algorithm involves a line search to perform the model updates (Nocedal and Wright 1999) and multiple forward and adjoint simulations maybe necessary per single l-BFGS iteration. In order to compare the results from Adam optimization and l-BFGS methods at the same computational cost (i.e., number of forward and adjoint simulations), we terminate the l-BFGS algorithm after 64 misfit function evaluations. Figure 9 displays the misfit function, inverted velocity model, and the initial and final data residuals corresponding to a source placed at the center of the model. It can be observed that the result from the Adam optimization algorithm with random shot selection is far superior than those obtained by conventional FWI. Second-order methods including the Gauss–Newton or l-BFGS method, though widely used in FWI (Virieux and Operto 2009) are not regularly employed in training deep neural networks as they have been found to terminate near saddle points (Goodfellow et al. 2016). However, accounting for the Hessian matrix is crucial in FWI to resolve trade-off between different parameters encountered in for instance, elastic FWI (Virieux and Operto 2009). We plan to test the suitability of first-order stochastic gradient-descent algorithms like Adam in multi-parameter FWI problems in future work.

Figure 10
figure 10

Synthetic model used to generate high frequency data. Ricker wavelet centred at 25 Hz plotted in the (a) time and (b) frequency domain. (c) Modelled data from a shot placed at the center of the true model (figure 1(a) and the source signature as shown in (a)).

4.2 Dependence on initial model, frequency content of the data: Cycle skipping

It is well known that the performance of FWI implemented with the \(l^{2}\) norm data misfit (e.g., equation 1) is dependent on the choice of the initial model and the bandwidth of the data utilized in the inversion (Bunks et al. 1995; Virieux and Operto 2009; Metivier et al. 2016). The initial model should be chosen such that the predicted and the observed data are within half the period of the signal. FWI with an initial model that violates this requirement converges to a local minimum and predicts incorrect velocity models and leads to an unsatisfactory match with the observed data; a phenomenon termed as cycle skipping (Virieux and Operto 2009). To address the issue of cycle skipping, Bunks et al. (1995) proposed a hierarchical strategy that inverts data in narrow frequency bands starting with lower frequencies and proceeding to higher frequencies. This strategy can be implemented either in time (Bunks et al. 1995) or frequency domain FWI (Sirgue and Pratt 2004; Operto et al. 2004).

Since the data misfit (equation 1) is based on the \(l^{2}\) distance between observed and predicted data, the algorithm presented here also suffers from the cycle skipping problem, if the chosen initial model does not predict data within half a cycle of the observed data. In order to illustrate this phenomenon, we simulate noiseless data with a Ricker wavelet centred at 25 Hz (figure 10). As with the earlier experiments, the initial model is chosen to be the one shown in figure 2(a). The velocity model obtained after 64 epochs of the Adam optimization algorithm implemented with a mini-batch size of 4 and a learning rate of 0.01 is shown in figure 11. The initial and final data residuals from the inversion are displayed in figure 12. The inversion result is unsatisfactory due to cycle skipping as evident from figures 12(c and d), as the initial model predicts data by more than half a cycle away from the true data. Consequently, the data predicted at the final iteration do not match the phase of the true data, indicating cycle skipping. Figure 13(a) displays a velocity model obtained by the application of a 20-point isotropic Gaussian smoothing filter to the true velocity model (figure 1a). The corresponding inverted velocity model (figure 13b) is closer to the true velocity model. The initial and final residuals are shown in figure 14 and it can be observed that the predicted initial data is within half a cycle of the true data and the data predicted at the final iteration display a better match with the true data as compared to figure 12. However, the velocity model obtained from inversion of data obtained from the inversion of higher frequency data (generated by Ricker wavelet centred at 25 Hz) does not compare well with that obtained from the inversion of narrowband, lower frequency data (generated by Ricker wavelet centred at 10 Hz). Although the initial velocity model does predict the first arrival to within half cycle of the observed data, later reflection events are not within the half cycle, leading the inversion to converge to a local minimum. This can be mitigated by following a hierarchical inversion strategy wherein narrow bands of data are inverted progressively from lower to higher frequencies. Filtering the data in the bandwidth of 1–20 Hz produced results identical to those presented in the last section (figure 7).

Figure 11
figure 11

Velocity model obtained after inversion with the noiseless data shown in figure 10(c). The initial model used for the inversion is displayed in figure 2(a). The algorithm was terminated after 64 epochs for a mini-batch size of 4 and the learning rate \(\alpha =0.01\).

Figure 12
figure 12

Data residuals corresponding to (a) the initial model (figure 2a) and (b) the inverted model (figure 11). Traces windowed around the first arrival corresponding to the modelled data from the true (solid-black), initial (gray-dotted), and inverted (red-dashed) velocity models are plotted for (c) \(x=0\) km, and (d) \(x=7.5\) km.

Low frequency data with a good signal-to-noise ratio is crucial for the success of FWI. Ovcharenko et al. (2019) demonstrated the ability of deep convolutional neural network to predict data at 0.25  Hz from data between 2 and 4.5 Hz modelled from synthetic benchmark models. Similarly, Sun and Demanet (2020) extrapolated data in the 0.1–5 Hz band from synthetic data in the 5–20 Hz band modelled from the Marmousi model. Dellinger et al. (2016) and Brenders et al. (2020) presented an ultra low-frequency marine source that can yield robust data in a bandwidth of 1.75–2.5 Hz at long offsets. Alternatively, the cycle skipping problem can be mitigated by defining the misfit based on an optimal transport distance (Metivier et al. 2016, 2018; Yang and Engquist 2019) to convexify the non-convex misfit function based on the \(l^2\) norm. In summary, the modifications to the misfit function, inclusion of low frequency data, or a hierarchical inversion strategy can be suitably combined with the algorithm presented here to produce accurate subsurface models from field data.

Figure 13
figure 13

(a) Initial velocity model obtained by applying a 20-point isotropic Gaussian smoothing filter to the true velocity model. (b) Velocity model obtained after inversion with the noiseless data shown in figure 10(c). The algorithm was terminated after 64 epochs for a mini-batch size of 4 and the learning rate \(\alpha =0.01\).

Figure 14
figure 14

Data residuals corresponding to (a) the initial model (figure 13a) and (b) the inverted model (figure 13b). Traces windowed around the first arrival corresponding to the modelled data from the true (solid-black), initial (gray-dotted), and inverted (red-dashed) velocity models are plotted for (c) \(x=0\) km, and (d) \(x=7.5\) km.

4.3 Towards implementation on field data

While we have presented the performance of the algorithm on the benchmark Marmousi model, future work will focus on implementing the algorithm on a suitable field dataset. Here, we outline a processing flow towards a potential implementation of the algorithm to field data. The source wavelet was assumed to be known exactly in the presented methodology. If the source wavelet is not available or accurately recorded for field data, it can be estimated by the methodology presented by Pratt (1999) and Plessix and Cao (2011). The initial velocity model can be estimated by first arrival traveltime tomography (Jaiswal et al. 2008, 2009; Prieux et al. 2009), by Laplace–Fourier domain inversion (Shin and Cha 2009) or with the application of global optimization methods (Datta and Sen 2016). It should be ensured that the initial velocity model predicts data within half a cycle of observed data. A hierarchical inversion strategy can then be followed, by applying the Adam optimization algorithm with random shot selection as presented above.

5 Conclusions

We presented a strategy for full waveform inversion with random shot selection based on Adam, a first-order stochastic gradient descent algorithm. The algorithm builds the search direction for model updates based on exponentially weighted first- and second-order moment estimates of the gradients from successive iterations. We outline an empirical strategy to select the hyperparameters of the Adam algorithm, particularly the learning rate, and the optimal mini-batch size for random shot selection. The algorithm is tested on synthetic data from the Marmousi model. We found that full waveform inversion with random shot selection yields superior results and converges faster than conventional FWI implemented with the l-BFGS method.