Efficient calculations of NSS-based gaze similarity for time-dependent stimuli

Salas, Jorge A.; Levin, Daniel T.

doi:10.3758/s13428-021-01562-0

Efficient calculations of NSS-based gaze similarity for time-dependent stimuli

Published: 09 June 2021

Volume 54, pages 94–116, (2022)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Efficient calculations of NSS-based gaze similarity for time-dependent stimuli

Download PDF

552 Accesses
1 Altmetric
Explore all metrics

Abstract

The degree of spatial similarity between the gaze of participants viewing dynamic stimuli such as videos has been previously measured using metrics which are based on the NSS (Normalized Scanpath Saliency). Methods currently used to calculate this metric rely upon a numerical grid, which can be computationally prohibitive for a variety of otherwise useful applications such as Monte Carlo analyses. In the present work we derive a new analytical calculation method for the same metric that yields equal or more accurate results, but with speeds than can be orders of magnitude faster (depending on parameters). Our analytical method scales well with dimensionality, and could also be of use for other applications. The drawback is that it can become very slow if the number of participants in the study is very large or if the gaze sampling rate is high. We provide performance benchmarks for a Fortran implementation of our method, and make available the source code developed.

RETRACTED ARTICLE: Eye tracking: empirical foundations for a minimal reporting guideline

Article Open access 06 April 2022

PsychoPy2: Experiments in behavior made easy

Article Open access 07 February 2019

Guided Search 6.0: An updated model of visual search

Article 05 February 2021

Introduction

Recent technological advances have made gaze tracking practical in a wide variety of settings, both inside and outside of the lab. One particularly promising application for gaze tracking has been to explore how gaze patterns might reveal on-line processing while participants view dynamic stimuli such as videos. For example, researchers have explored how gaze patters might reveal differences in the knowledge or point of view people use to organize their understanding of videos (Huff et al., 2017; Loschky, Larson, Magliano, & Smith, 2015). Differences in gaze patterns between successful and unsuccessful problem solvers when looking at diagrams (Grant & Spivey, 2003) imply that gaze patterns may also reveal differences in learning from videos. One approach to analyzing gaze patterns is to relate them to the visual contents of videos, but it can be difficult to reliably extract valid scene-based regions of interest, calibration issues in real-world settings may preclude effective gaze-object linkages, and it may be impractical to recode each new video one wishes to test. So, an alternative approach is to directly assess the degree to which interesting cognitive similarities between individuals might be reflected in similar gaze patterns across the common time-series of events depicted in the video.

There are many published methods for assessing the similarity between spatio-temporal trajectories. Some have been developed for trajectory data mining purposes, and are very general in terms of the type of data they can be used on. (For some comprehensive reviews of trajectory data mining techniques see Zheng 2015; Atluri et al. 2018 and Bian et al. 2018). Other methods have been developed specifically to compare scanpaths of different viewers, taking into consideration patterns specific to eye fixations and saccadic movements (Anderson, Anderson, Kingstone, & Bischof, 2015; Dorr, Martinetz, Gegenfurtner, & Barth, 2010). Some of these approaches consider similarity between multiple features like position, duration, sequence, amplitudes, and directions (Anderson et al., 2015). However, not all methods are suitable for the analysis of gaze with dynamic stimuli. According to Anderson et al., (2015) the scanpath comparison method should be chosen based on the hypothesis being tested, and for the case of a video, sensitive measures of similarity may be needed. Many approaches such as Scan Path Entropy (Hooge and Camps, 2013), Hidden Markov models (Coutrot, Hsiao, & Chan, 2018), and the popular ScanMatch algorithm (Cristino, Mathôt, Theeuwes, & Gilchrist, 2010), rely on designated regions of interest (ROIs), which may be difficult to establish, especially for experiments involving videos. Additionally, many methods rely on some form of clustering and employ some hard-cutoffs in the process, which could exhibit instabilities due to small perturbations (Dorr et al., 2010), which is an undesirable characteristic.

Currently, a popular approach to linking similarities in gaze patterns with similarities in cognitive variables is to compute similarity metrics which are based on the well-known Normalized Scanpath Saliency (NSS). In general, NSS assesses the degree of correspondence between eyegaze locations on a plane and a scalar field. In many works the scalar field represents variations in perceptual saliency across locations in a visual scene, and NSS has been frequently used to validate computational models of perceptual saliency by comparison to gaze positions (Peters R.J., Iyer A., Itti L., & Koch C., 2005; Riche et al., 2013; Sikha, Kumar, & Soman, 2018).

More recently the NSS has been adapted by Dorr et al., (2010) and other subsequent works (e.g. Loschky et al. 2015and Huff et al. 2017) for the purpose of calculating similarity of gaze scanpaths over time, within and between groups of participants. The goal is to test the degree to which subgroups of participants who might be expected to think differently about a video actually tend to look at different things. If this occurs, then one would expect that the mean similarity of between-participant gaze patterns within groups would be higher than the similarity of gaze patterns between groups (Dorr et al., 2010; Loschky et al., 2015; Lukavský, 2013; Huff et al., 2017; Borji, Sihite, & Itti, 2013). For example, in one study by Huff et al., (2017), fans of two different soccer teams viewed a video of a match between the teams, and NSS was used to test whether fandom affected gaze patterns by assessing the degree to which each fan’s gaze pattern was more similar to other fans of the same team than to fans of the other team. Since the NSS similarity does not require designating regions of interest it is a suitable candidate method for these types of studies (See Dorr et al., 2010, for a more in depth discussion of the advantages of the NSS).

Several variations of NSS-based similarity measures have been used in previous works, but their calculation procedures are closely related. In general, these measures are obtained by first slicing the video into time windows (usually organized around each frame of video), and then obtaining gaze maps and similarity scores for each of those windows (Dorr et al., 2010). Collective gaze maps are constructed from the gaze of groups of participants by superposing Gaussian functions centered at the points of gaze locations (Wooding, 2002; Lukavský, 2013; Dorr et al., 2010). In the example above by Huff et al., (2017), one gaze map was created representing the scanning patterns of fans of soccer team A and another represented fans of team B. The NSS scores were calculated using a leave-one-out procedure in which Gaussians centered on a given participant’s gaze point for each video frame were sampled from a gaze map representing other members of the same fan group, and from the map representing fans of other other team. If fandom drives gaze then we would expect the same-group convolutions would result in higher participant-to-group similarity than between-group convolutions. We note that our reference to “gaze maps” as opposed to “fixation maps” is intentional because these algorithms usually rely on raw gaze positions, rather than the fixations, in order to avoid relying on fixation detection algorithms that may be unreliable for video (Dorr et al., 2010).

Previous works have employed variations with Gaussians that are 2D (e.g. Loschky et al., 2015; Marat et al., 2009), as well as 3D (Dorr et al., 2010; Lukavský, 2013, e.g.). The 3D version considers time as a third dimension, so similarities include both the gaze position at the present time and a small number points before and after, with those that are temporally further away from the present having a lower weight. The 2D version ignores the temporal coordinate and treats all gazepoints within a time window as if they occur simultaneously. An important difference is that the 3D calculation takes into account the temporal ordering of gaze points, while the 2D does not.

One substantial drawback of the methods used in previous works is their reliance upon very computationally expensive calculations for video-like stimuli. The reason for this is that the current calculations of NSS-based metrics employ numerical grids to represent the gaze maps (see Lukavský, 2013), with floating point values for each image pixel, and this is iterated over the many frames in a movie (and iterated again by creating new leave-one-out maps for each participant). The number of pixels in high definition video can be quite large. For example, full resolution 2k video has grid size of 2,073,600 pixels and UHD 4k video has a grid size of up to 8,847,360 pixels. Additionally, for some modern potential applications such as eyetracking in immersive 360 degree VR environments the needed grid could be too large for the method to be practical.

It may be possible to counteract the computational burden in grid calculations by downsampling the data in space or time (sometimes referred to as binning) (Lukavský, 2013). This could come at the cost of accuracy, so it must be done with caution, and the calculations could still be intensive. Also, some applications, such as screen-captured instructional videos, may require fine discriminations between, for example, words and characters that will take advantage of increasing eye tracking accuracy. Thus, a new approach that overcomes these computational burdens while maintaining accuracy is desirable.

There are a number of important reasons to require computational efficiency for NSS calculations. For example, it is often useful to apply Monte Carlo techniques (such as resampling, bootstrapping, or permutation tests), to assess statistical significance. The need for Monte Carlo techniques is invoked in part by the uncertain between-participant independence characteristic of calculations that compare each participant to highly similar leave-one-out maps and by the need to ensure robustness to outliers. Monte Carlo methods can be very powerful and have become more popular recently, for they avoid many of the independence and distributional assumptions inherent to traditional statistical methods, but they require executing many resampling iterations (Manly, 2006). A Monte Carlo approach can typically require 1000 random shuffles (for example, of participant group membership) each of which requires recalculating the entire NSS. Therefore their usage for grid-based calculations can easily become computationally prohibitive.

Caldara and Miellet (2011) analyze 3D Gaussian fixation maps (considering temporal data), using iMap which is an approach inspired by functional magnetic resonance (fMRI) methods. The statistical maps generated by iMap also encounter difficulties due to the high number of pixels. Such limitations were overcome in their work by using random field theory (RFT).

A later version of the work by (Lao, Miellet, Pernet, Sokhn, & Caldara, 2017), iMap4, introduced a pixel-wise linear mixed model (LMM) that is used on Gaussian-smoothed 2D eye fixation maps (no temporal dependence). Monte Carlo statistical tests were employed there as well, and this required some approximations and additional strategies to reduce the large computational burden. Spatial downsampling and masking of pixels were both used by iMap4 to achieve these efficiencies. Lao et al., (2017) also give some guidelines on downsampling.

Eye tracking data can be augmented to very high dimensions with other variables, beyond the x-y coordinates of the gaze (Burch, 2018; Blascheck, Burch, Raschke, & Weiskopf, 2015). For instance, the gaze data acquired can be stereoscopic, and if also taking vergence into consideration then a z dimension needs to be added (Mathôt, Cristino, Gilchrist, & Theeuwes, 2012). Some examples of related ocular data commonly used to augment the raw gaze are : fixation durations, saccade lengths and orientations, pupil dilations (Blascheck et al., 2015; Burch, 2018; Cristino et al., 2010; Mathôt et al., 2012). Also, Blascheck et al., (2015) and Burch (2018) mention that eyetracking data is also being augmented with other sources such as blood pressure, electroencephalography (EEG), functional magnetic resonance imaging (fMRI), galvanic skin response (GSR), mouse and keyboard interactions, motion tracking, verbal data, amongst others. Due to this complexity, some methods have been devised to consider multiple dimensions when analyzing the scanpath (see for example Foerster and Schneider 2013; Cristino et al. 2010; Math$\hat {\mathrm {o}}$t et al. 2012; Dewhurst et al. 2012 and Jarodzka et al. 2010). For the case of the calculation of NSS similarity with a grid, high-dimensional gaze data may induce prohibitive computational costs that may further limit the application of this technique.

The new calculation method devised in the present work uses an exact mathematical solution, considering space as continuous rather than discretized, to calculate gaze similarity. It also preserves spatial accuracy regardless of the resolution of the video. We will refer to this approach as the analytical NSS. In short, it yields results with equal or better accuracy than the grid-based approach, and it is capable of running orders of magnitude faster. In the limit of the resolution becoming infinitely large the grid and analytical method return the same scores. Additionally, the analytical NSS can be easily used for gaze data with very high dimensions, allowing its usage for more applications.

We present two forms of the analytical NSS: one to which we refer to as exact, which is limited by computational numerical accuracy, and an approximate form. Both are identical in almost every aspect, and differ only in the equations used to calculate them. Therefore whenever the analytical NSS method is mentioned it is a reference to both forms, unless otherwise specified.

The goal of the present work is to introduce the new analytical NSS approach to calculate gaze similarities. We first present a detailed overview of typical grid-based calculations used in the literature. We then explain our new algorithm and its mathematical formalism. Afterwards, we discuss the usage of Monte Carlo methods with the analytical framework. In the last section we then compare the speed performance of our Fortran implementations of both grid and analytical algorithms, and demonstrate the analytical method’s capability for Monte Carlo calculations.

The overall algorithm

The overall algorithm to calculate the NSS similarity metrics is the same, regardless of whether the calculations are made with the grid-method or the analytical one. In this section we explain what needs to be calculated rather than how, which is left for the later sections detailing each calculation method.

Multiple variations of the NSS similarity metrics have been employed in various studies, although the differences are small, with the core principles from Dorr et al., (2010) being the same. Instead of trying to cover all variations we try to keep the presented formalism in a generalized form so it can be adapted to meet particular needs. The equations presented here assume a 3D calculation, which is easily reducible to the 2D case. Also, we will assume an arbitrary number of groups and participants. We only consider the gaze data from a single eye for each participant (the binocular case is an application with higher dimensionality, for which the method could be extended).

Definitions and notation

The gaze data are split into time windows, which in general could overlap. Each time window W of duration T_W contains all the gaze data between times $t_{W_{0}}$ and $t_{W_{f}}=t_{W_{0}}+T_{W}$. Typically the time windows are selected such that they contain data corresponding to one or more video frames. The NSS score is then calculated for each participant with respect to groups of other participants, for each time window separately. A flowchart with the entire NSS calculation for gaze similarity for a single video is shown in Fig. 1.

Let u denote the participant index, $u \in \{ 1{\dots } n_{u} \} $, where n_u is the number of participants, and U the set of all participants. We also define a group of participants g as a subset $g \subseteq U $.

Groups may be chosen depending on the application, and one typical use would be to account for different experimental conditions. The set of all groups of participants will be denoted as $\mathcal {G}$.

The participant’s gaze points are denoted as r_i,u = (x_i,y_i,t_i), where x_i and y_i are the horizontal and vertical spatial coordinates (commonly measured in pixels), and t_i is the time, which in a 3D calculation is considered as a third dimension. We define P_u as the set of gaze points that correspond to participant u.

We will assume a constant sampling rate for simplicity in the present work (although the algorithms can still be applied if it is not the case). We would then have n_t evenly-sampled time points per window for one participant. However, gaze data losses can occur so therefore participant u has M_u ≤ n_t gaze points within W. Since there are n_u participants there is a total maximum n_w = n_u × n_t gaze points within the window.

For each participant a scalar field $\mathbb {R}^{3} \!\rightarrow \! \mathbb {R}^{1}$ is constructed consisting of a superposition of 3D Gaussian functions, with each one centered at the participant’s recorded gaze locations in space and time

$$ \varphi_{u} (x,y,t) = \underset{i}{\sum} \phi_{u}(x_{i}, y_{i} , t_{i} ) , $$

(1)

with

$$ \phi_{u}(x_{i}, y_{i} , t_{i} ) = e^{-\nu_{x}(x-x_{i})^{2} -\nu_{y}(y-y_{i} )^{2} -\nu_{t}(t-t_{i})^{2} } . $$

(2)

Here $\nu = \frac {1}{2\sigma ^{2}}$, with σ being the standard deviation of the Gaussian in each dimension. We may assume ν_x = ν_y (σ_x = σ_y) for simplicity, although it is possible to have arbitrary values.

The gaze map for observer u respect to a reference group of other participants g is then given by the sum of their individual scalar fields:

$$ F_{u, g }(x,y,t) =\underset{u^{\prime} \in g \wedge u^{\prime} \neq u }{\sum} \varphi_{u^{\prime}}(x,y,t) , $$

(3)

The raw score η, a scalar, is then obtained by first evaluating the gaze map at the gaze points of participant u that are within the window W, and then averaging those values, as given by

$$ \begin{array}{@{}rcl@{}} \eta_{u,g}= \frac{1}{ M_{u}} \underset{j}{\sum} F_{u, g }(x_{j} ,y_{j},t_{j} ) , \end{array} $$

(4)

where the sum is done over all valid gaze points of the participant. The raw score can be Z-score normalized by subtracting the mean $\bar {F}_{u, g } $ and dividing by the standard deviation Std(F_u,g).

$$ NSS_{u, g } = \frac{ \eta_{u,g} - \bar{F}_{u, g } }{ Std(F_{u, g} ) } $$

(5)

A high positive NSS value implies the participant u is looking at a similar location as the others in the reference group, while a low score indicates that the participant is looking at a location where few others are.

The Grid-based calculation

We now describe the outline of a grid calculation (Lukavský, 2013). In this method the scalar field in Eq. 3 is represented using a 3D numerical grid with n_x × n_y × n_t points (or n_x × n_y for the 2D case), at minimum. In the present work we assume that n_x and n_y are the number of pixels on the screen in the x and y directions respectively (i.e. no spatial binning is used), unless otherwise specified. Additionally, we assume that in the time dimension there are n_t points, (i.e. one gridpoint per gaze sample) which is the minimum needed to avoid temporal binning, unless otherwise indicated. However, in practice one would probably want to have more timepoints for reasons of accuracy, in addition to any required padding during the computation (More details on grid padding can be found in Vetterling and Press, 1992).

An efficient procedure that is typically used to generate a gaze map scalar field is to first create a mask grid with all values initialized to zero, then add 1 to those grid points corresponding to the gaze locations, and afterwards convolve it with another grid containing a single Gaussian function (Tanner & Itti, 2019; Torralba, Castelhano, Oliva, & Henderson, 2006; Lukavský, 2013), which yields as a result a grid with the desired gaze map. Some notes on the implementation of this calculation and its efficiency are given in Appendix B.

Let h be the Gaussian kernel grid and B_g the binary mask formed by the gazepoints of the participants from reference group g, all within a time window. The convolution then yields

$$ \begin{array}{@{}rcl@{}} F_{u, g} &=& B*h . \end{array} $$

(6)

One can calculate F_u,g by using Eq. 6 and a convolution routine or a Gaussian filter from existing software libraries. It can also be implemented in an efficient manner using the Fast Fourier Transform (FFT) (Vetterling & Press, 1992). Denoting the FFT and its inverse as $\mathcal {F}$ and $\mathcal {F}^{-1}$, and making use of the convolution theorem we therefore have that the scalar field F_u,g in Eq. 3 is then given by

$$ F_{u, g}= \mathcal{F}^{-1}\left( \mathcal{F}(B) \mathcal{F}(h) \right) . $$

(7)

The average value of the grid F_u,g and the standard deviation can be easily calculated, and then by performing a Z-score normalization one obtains the scalar field η_u,g. The NSS score is then obtained by evaluating η_u,g at the gaze point locations of participant u, using Eq. 5.

We summarize the calculation of the NSS score with a grid in Algorithm 1 and Fig. 1.

Algorithm complexity of the grid NSS

Setting the values in the mask to 1 is a very simple operation that must be done once for each Gaussian function in the window, and therefore has complexity $ \mathcal {O}(n_{t}n_{u} ) $. For the convolution process using FFT the computational complexity is $ \mathcal {O}(N\log (N)) $, where N is the total number of gridpoints (Press, Teukolsky, Vetterling, & Flannery, 2007).

The number of gridpoints N will be greatly affected by the dimensionality of the calculation. It will depend of the number of pixels on screen and time points within a window for a 3D calculation (or merely the number of pixels for the 2D case). Unless the number of samples per window or of participants is very large the bottleneck of the entire calculation is likely to be the FFT rather than setting the mask. In that case the computational burden to generate a map for one or many participants is very similar.

For example, for a 2D calculation with a 1600x900 screen (with no spatial downsampling) the number of gridpoints needed is 1.44 million. If we assume a time window with 500 samples and 200 participants we would require setting 100k gridpoints to 1 in the mask. If we consider instead a 3D calculation with the same parameters the number of gridpoints needed is then 720 million, while still having 100k gridpoints to set to 1. In both example cases the bottleneck of the calculation is still the FFT.

The task of setting up the mask might seem like a very repetitive one, and it may be possible to somehow optimize it, although no major improvement is expected unless processing a very large number of participants. Optimization efforts are probably best invested into the usage of strategies to cut down the number of gridpoints (in either time or space) or parallelization.

The analytical calculation

One can make several key observations from the grid algorithm that suggest improvements that serve as the basis for the new method:

Many of the operations are sums over all the grid points.
All points on the grid need to be computed on all iterations, but many of points are near zero because Gaussians decay rapidly.
The leave-one-out procedure is repetitive, and data corresponding to the same participant get introduced or removed together from the calculation.

The calculation can be optimized by assuming space as a continuum and solving some integrals analytically, which avoids the need to sum over every grid point. Also, pre-calculating reduces repetitive storage and retrieval of values. By aggregating these values by participant, the leave-one-out process becomes even more efficient, since fewer operations are needed per iteration.

The overall analytical method of calculation consists of pre-calculating as many values as possible and storing them into arrays (only once per time window), which are then iterated over each participant and reference groups to calculate the NSS scores. Any NSS score involving any arbitrary combination of participants and groups can be obtained using the pre-calculated array values. A large initial investment in computational effort is required, but this makes subsequent iterations within the same window much faster. The analytical method still maintains the coarse structure described by the flowchart in Fig. 1.

Three key quantities are needed to obtain the NSS score, as seen in Eq. 5: the raw score, the average, and the standard deviation. Each of these can be calculated analytically employing a few equations and arrays S, A, θ and M, which are to be pre-calculated once for every time window.

The matrix element $S[u,u^{\prime }]$ is the total Gaussian similarity between the pair of participants u and $u^{\prime }$, and it is used to calculate the raw score. Then, A[u] is the contribution of participant u to the average of the gaze map. Also, $\theta [u,u^{\prime }]$ is the total overlap between the distributions of participant u and $u^{\prime }$, and is used to calculate the standard deviation of the gaze map. Lastly, M[u] has the number of valid gaze points for participant u within W. A summary of the arrays is given in Table 1, and more detailed descriptions are presented in the upcoming sections, with the full derivations are given in the Appendix. More details of the analytical method of calculation are given in Algorithm 2.

Table 1 Summary of the arrays that are pre-calculated in each time window, with the quantities they are related to and the references to the equations that defines them

Full size table

There is no numerical grid representing the gaze map in this method of calculation, so we no longer use n_x, n_y, and n_t. However, the gaze map is still confined within a 3D box defined by the screen size in space and the time window. The sides of this box have lengths L_x, L_y (in spatial units or pixels) and L_t = T_W (time units). The volume of the box is then

$$ \begin{array}{@{}rcl@{}} V=L_{x}L_{y}L_{t}. \end{array} $$

(8)

Raw score

We conveniently define the Gaussian similarity between two gaze points as

$$ \mathcal{S}(\mathbf{r}_{i}, \mathbf{r}_{j} ) =e^{- \nu_{x}(x_{i}-x_{j})^{2} -\nu_{y}(y_{i}-y_{j})^{2} -\nu_{t}(t_{i}-t_{j})^{2} } , $$

(9)

which ranges from 0 to 1, with the latter occurring when the two points are identical.

We can also define the Gaussian similarity for a pair of participants, which will be the sum of the similarities of their gaze points. Let r_i,u and $\mathbf {r}_{j,u^{\prime }} $ be the ith and jth gaze points of participants u and $u^{\prime }$. We then have

$$ \begin{array}{@{}rcl@{}} S_{u,u^{\prime} } =\underset{i}{\sum}\underset{j}{\sum}\mathcal{S}(\mathbf{r}_{i,u} ,\mathbf{r}_{j,u^{\prime}} ) , \end{array} $$

(10)

where the sum is done over all valid gaze points of those participants within the window.

Using this definition the raw score, Eq. 4, can be rewritten in a form that is more convenient for the analytical method,

$$ \begin{array}{@{}rcl@{}} \eta_{u,g} = \frac{1}{M_{u}} \underset{u^{\prime} \in g \wedge u^{\prime} \neq u}{\sum} S_{u,u^{\prime} } . \end{array} $$

(11)

Average gaze map - exact form

The contribution of a gaze point r_i to the average of the gaze map is

$$ \begin{array}{@{}rcl@{}} \mathcal{A}(\mathbf{r}_{i})&=&\frac{1}{L_{x}L_{y}L_{t}} \alpha(0, L_{x}, x_{i}, \nu_{x},)\alpha(0, L_{y}, y_{i}, \nu_{y},)\\&&\alpha(t_{W_{0}}, t_{W_{f}}, t_{i}, \nu_{t}) , \end{array} $$

(12)

where

$$ \begin{array}{@{}rcl@{}} \alpha(p, q,x_{0}, \nu) &=& \frac{ \sqrt{\pi} }{2\sqrt{\nu} } \left[ \mathit{erf}\left( \sqrt{\nu}(q-x_{0}) \right) \right.\\&&\left.- \mathit{erf}\left( \sqrt{\nu}(p-x_{0}) \right) \right] , \end{array} $$

(13)

and erf is the error function (Abramowitz, 1974).

The total contribution of the participant to the gaze map average is then given by the sum

$$ \begin{array}{@{}rcl@{}} A_{u} = \underset{i}{\sum} \mathcal{A}(\mathbf{r}_{i,u} ) \end{array} $$

(14)

where the sum goes over all valid gaze points of participant u.

The average of the gaze map of participant u respect to reference group g is then given by

$$ \begin{array}{@{}rcl@{}} \bar{F}_{u,g} = \underset{u^{\prime} \in g \wedge u^{\prime} \neq u }{\sum} A_{u} . \end{array} $$

(15)

Standard deviation of gaze map - exact form

We now show the equations for the standard deviation of the gaze map, which requires the overlap matrix elements to be calculated. The overlap between two Gaussians centered at gaze points r_i and r_j is given by

$$ \begin{array}{@{}rcl@{}} {{\varTheta}}(\mathbf{r}_{i},\mathbf{r}_{j}) &=& R(p_{x},q_{x},x_{i},x_{j},\nu_{x})R(p_{y},q_{y},y_{i},y_{j},\nu_{y})\\&&R(p_{t},q_{t}, t_{i},t_{j}, \nu_{t}) , \end{array} $$

(16)

where

$$ \begin{array}{@{}rcl@{}} R(p,q,x_{1},x_{2}, \nu) &=& e^{-\nu x_{-}^{2}/2}\frac{1}{2}\sqrt{\frac{\pi}{2\nu}}\left( {erf}(\sqrt{2\nu}(q-x_{+}) ) \right.\\&&\left.- {erf}(\sqrt{2\nu}(p-x_{+}) ) \right) , \end{array} $$

(17)

and

$$ \begin{array}{@{}rcl@{}} x_{-}&=&x_{1}-x_{2} \end{array} $$

(18)

$$ \begin{array}{@{}rcl@{}} x_{+}&=&\frac{x_{1}+x_{2}}{2} . \end{array} $$

(19)

We can also define an overlap matrix at the participant level, akin to the one shown previously for the Gaussian similarity. This one will has as elements the sum of all the overlaps between all the Gaussians corresponding to a pair of participants. We then have

$$ \begin{array}{@{}rcl@{}} \theta_{u,u^{\prime} } =\underset{i}{\sum}\underset{j}{\sum}{{\varTheta}}(\mathbf{r}_{i,u} ,\mathbf{r}_{j,u^{\prime}} ) , \end{array} $$

(20)

where the sum is done over all valid gaze points of those participants within the window.

The standard deviation of the gaze map can be obtained from the overlap matrix Eq. 20 and the average field Eq. 15. It is given by

$$ \begin{array}{@{}rcl@{}} Std(\theta,\bar{F}_{u, g} ) = \sqrt{ \frac{1}{V} \underset{u^{\prime} ,u^{\prime\prime} \in g \wedge u^{\prime} ,u^{\prime\prime} \neq u }{\sum} \theta_{u^{\prime},u^{\prime\prime} } - \bar{F}^{2}_{u,g } } , \end{array} $$

(21)

where V is the volume of the box given by Eq. 8.

Approximate forms

A drawback of the analytical method previously presented is that it requires calling the error function multiple times for each matrix element. For a 3D calculation each one requires six error function calls, and four in a 2D case. It is possible to employ an approximation to avoid this if all the gaze points have their center far enough from the boundaries of the box.

Since Gaussians decay very rapidly to zero, if they are far from the screen boundaries the error function approaches its limit value

$$ \begin{array}{@{}rcl@{}} \underset{x \to \pm \infty }{\lim} \mathit{erf}\left( \sqrt{\nu}(x-x_{0}) \right) =\pm 1 . \end{array} $$

(22)

Using the approximation in Eq. 22 into Eq. 13 gives

$$ \begin{array}{@{}rcl@{}} \alpha(p,q,x_{0}, \nu) \approx \frac{ \sqrt{\pi} }{\sqrt{\nu} }. \end{array} $$

(23)

Similarly, the overlap integral in Eq. 17 is then simplified to

$$ \begin{array}{@{}rcl@{}} R(p,q, x_{1},x_{2}, \nu) &\approx& \sqrt{\frac{\pi}{2\nu}} e^{-\frac{1}{2}\nu x_{-}^{2} } . \end{array} $$

(24)

One can see that this approximation results in a much simpler expression and the error functions are entirely avoided, making the calculation faster, which will be demonstrated in the benchmark section. The algorithm remains exactly the same in every other aspect.

However caution must be taken when using the approximate form, for it could yield an inaccurate result if the Gaussians are too close to the edge of the screen. In that case one possibility is to assume a screen size larger than the actual one to well accommodate all the points, which produces no penalty in the computational burden for the analytical method, although it might affect the meaning of the results. Our code allows the usage of both the exact and approximate forms of the analytical calculation.

Algorithm complexity of the analytical NSS

The computational complexity for the analytical NSS is determined by the number of time points in each window and the number of participants, while being independent of the screen resolution.

The average map array A, from Eq. 14, is one dimensional and has a size n_u. One can calculate $\bar {F}_{u,g }$ by summing only the needed participants within the group. The calculation of the average field requires adding n_u terms at the most, and will therefore have a complexity of $ \mathcal {O}(n_{u} )$ operations.

The matrix elements of the similarity and overlap matrices, S and θ, are the ones that will require the largest computation time. Each of these matrices has ${n^{2}_{u}}$ elements. Any NSS score within the time window can be calculated for any group by adding up combinations of these matrix elements, so the process hereby has a complexity $ \mathcal {O}({n^{2}_{u}} )$. However, to pre-calculate all these matrix elements one needs to iterate over all possible pairs of participants and pairs of Gaussians. Since each participant has at most n_t gaze points within the window the computational complexity becomes $ \mathcal {O}({{n}_{u}^{2}}{{n}_{t}^{2}} )$. It is therefore possible that the pre-calculation of the matrices will become the bottleneck of the entire process.

Theoretical comparison between grid and analytical methods

A comparison of the computational complexities between the grid and analytical methods is shown in Table 2. The main difference is that in the grid method there is nothing to pre-calculate for each window, for it sets up the binary mask and executes the FFT from scratch every iteration. On the other hand the analytical method makes a large initial investment in computation on each window to obtain the matrix elements, and the gain is that all the subsequent calculations within that window become very fast because they only require a retrieval and summation of these precomputed matrix elements. It becomes quite inexpensive to calculate the similarity score from any arbitrary combination of participants, with any inclusions, exclusions or repetitions needed, such as the leave-one-out procedure. This gain in speed opens the possibility of applying Monte Carlo techniques such as bootstrapping or permutation testing.

Table 2 Comparison of complexities between existing (grid-based) and new (analytical exact and approximate) methods on the calculation of a single window

Full size table

Having a quadratic complexity might suggest that the new method could be slow. Nevertheless, the number of participants in many eyetracking studies is usually not very large, < 100, so this method can then become faster than the grid-based calculation by orders of magnitude, as we demonstrate with the benchmarks made with our implementation.

The grid method is strongly affected by dimensionality due to the high number of points required, and extending to higher dimensions can be difficult. On the other hand, the analytical method scales much better with dimensionality and can be easily expanded to additional dimensions, requiring simply another Gaussian integral per dimension to calculate for each matrix element.

The number of time points per window is an important factor affecting speed for both methods. It can vary depending on the sampling rate, which, in the case of research-grade eye trackers, can be in the range 30 − 2000Hz. Since the motions of most interest for dynamic stimuli tend to be fixations and smooth pursuits rather than saccades, sampling rates above 100Hz are often not needed. Video frame rates are typically $\sim 30 Hz$, and there might be little need to have too many samples per frame, so time downsampling could be a good option to improve performance for both methods, which we later demonstrate.

The exact method also allows one to export the matrices for every time window, which give more information on individual participants, and can be useful for other analyses and data mining algorithms.

Monte Carlo

We now show how the overall algorithm can be generalized to include the application of Monte Carlo methods.

Monte Carlo simulations would only imply extra repetitions of the within-window calculations with some variations in the data employed. Each of these variations is made simply by adding or removing participants to the calculation groups.

For example a permutation test can be done by shuffling the participants into new groups on every Monte Carlo iteration, and repeating the window calculation. A bootstrap calculation would instead require random sampling with replacement.

We can easily generalize the formulation of the calculation for Monte Carlo methods. The flowchart then becomes as shown in Fig. 2.

The theoretical advantage of the analytical NSS here becomes clear because all the intensive computation is pre-calculated once per window, requiring minimal operations to obtain the results in subsequent iterations. However, the random number generation is a process that has not been taken into account in the theoretical formalism, and it can slow down Monte Carlo calculations so the actual performance must be measured in a speed test.

Speed performance benchmarks

We implemented a Fortran code that executes a grid based calculation, as well as the new analytical method in both exact and approximate forms. A big effort was made to make the grid calculations efficient, including the implementation of our own convolution subroutine optimized for this problem. More details on our implementation can be found in Appendix B.

Here we provide a demonstration of the performance of all calculation methods. We used three sets of parameters which are similar to those of some previous studies from Loschky et al., (2015); Dorr et al., (2010) and Marat et al., (2009), in order to make the execution times realistic for future studies in the field. We also created a fourth set of parameters which uses a low sampling rate of 30Hz, which is the approximate video sampling rate. The most relevant parameters for each of these sets are given in Table 3. We denote these parameter sets as A, B, C and D.

Table 3 Sets of parameters used for the benchmark

Full size table

The resolution marked with an asterisk (*) in Table 3 has been reduced by a factor of 4 in each dimension respect to the original value. This spatial downsampling (binning) allows the grid calculation to run much faster, and is done merely as a reference for the speed performance, regardless of the accuracy. A comprehensive analysis of the accuracy of the grid approach is beyond the scope of the present work. Nevertheless, a brief test comparing the grid and analytical method scores is presented in Appendix E.

Synthetic gaze data were generated for each run of the benchmark individually, accounting for the number of simulated participants. The purpose of this test is to measure the speed of the methods and ensure correctness of the results, rather than obtaining physiologically accurate ones. See Appendix D for more details on the synthetic data.

Benchmarks were conducted for both 2D and 3D calculations, as well as the case of a Monte Carlo permutation test. In each run, a set of parameters was selected together with a quantity of participants, and then the calculations were done for five time windows. The (simulated) participants were split into two equal groups for each run. All the NSS scores obtained were exported to a file, so file writing times are taken into consideration (these could become more significant for the Monte Carlo calculations).

The reported times only consider the outermost loop depicted in Figs. 1 and 2, which correspond to the calculation of the NSS score the time windows. We excluded the time for global setup operations (which include reading data, array allocations and initializations), since they are only done once per run, regardless of the length of the calculation and not indicative of the algorithm performances. The reported times are the averages per window. All the results presented in the following sections are available in tabular form in the code repository.

2D calculation benchmark

The results for 2D calculations are given in Fig. 3.

We observe that the grid calculation’s computation time exhibits a linear behavior respect to the number of participants. This is consistent with the complexity shown in Table 2, since the first term is the dominant one due to the high number of gridpoints, and quadratic effects are still not large.

On the other hand the non-linear complexity of the analytical method is evident for all parameter sets. The approximate form can indeed provide a noticeable gain in speed.

The analytical approach is overall faster for parameter sets A and D, and considerably slower for C.

For the set B we have that the performance is not too different if the number of participants is below 50. It is important to notice that this parameter set has spatial downsampling, without which the grid method would be much slower.

For set C we observe that the analytical method is very slow, even for a small number of participants. The reason is because the number of gaze samples per time window is very large, consequence of the high gaze sampling rate, which greatly affects the computational complexity of the analytical approach. The grid method is resistant in this case because it is a 2D calculation.

In set D the computation is fast for all methods due to the small grid and low number of time points per window, which is a consequence of the low gaze sampling rate. This shows the effectiveness of downsampling in both space and time. It is therefore that temporal downsampling of the gaze can be an effective tactic used in conjunction with the analytical method.

3D calculation benchmark

We now show some speed tests doing 3D calculations with the same parameter sets as before, except for C because it has a grid that is very large (Loschky et al., 2015) had similar parameters but only did a 2D calculation. The standard deviation for the temporal dimension, σ_t, was taken as 26.25ms, and is based on the value used in by Dorr et al., (2010). The results are shown in Fig. 4, and more details on the implementation can be found in Appendix B.

The results for parameter set D for two and three dimensions are shown in Table 4, and a plot of those values corresponding to the analytical calculations is given in Fig. 5 for ease of comparison. It is important to notice that the increase in dimensionality does not cause an immense increase in computation time for the analytical NSS, unlike the grid method. The grid calculation speeds obtained in 3D are about 10 times slower than in 2D, while for the analytical calculations the slowdown is less than a factor of 2. Also, in the 3D case the measured speed of the analytical method is greater than the grid-based by a factor ranging between about 30 to over 500, despite that this grid is low-sampled (else the speed gain would likely be more than 16 times greater).

Table 4 Average times per window (in seconds) for 2D and 3D calculations for different numbers of participants n_u and all methods, using parameter set D

Full size table

Monte Carlo benchmark

For testing the Monte Carlo capability of the code we conducted a permutation test for the participant grouping, with 1000 iterations for each window. The calculation for each iteration is the same as in the tests in “2D calculation benchmark”, with the difference being that the groups are randomized on each iteration except for the first one. A flowchart with the Monte Carlo permutation test conducted is shown in Fig. 6.

One of the goals of this benchmark is to measure speed of the code while generating random numbers, which are essential to all Monte Carlo computations. The group randomization is done using a Knuth shuffle, and the random numbers obtained using the well-known Mersenne Twister generator from the MKL library (Intel, 2017). The performance could depend on the choice of generator, and other options exist in MKL, although not tested here. Other Monte Carlo tests are possible (e.g. bootstrap), although the computational requirements should be very similar and therefore we expect these results to generalize in that sense and also be insightful of them.

The results for the Monte Carlo permutation test using the analytical method in exact form, with parameter set D for two-dimensional calculations, are shown in Fig. 7 (It is not done for the grid calculations since they are very slow, and their expected duration is at least 1000 times longer than that of the single window). One can observe in the figure that the calculation with Monte Carlo takes longer, but the execution time does not increase by a factor of 1000 but instead remains low. We also know from the previous results from Fig. 5 that doing the calculation in 3D will still maintain the execution time reasonable.

From the results obtained in Fig. 7 the estimated time for 200 participants for a one minute video at 30 fps is about 12 hours, and even faster if the approximate form is used. A grid calculation would instead take an estimated time of 2000 hours for the same task.

Discussion

The results are consistent with theoretical expectations, and yield efficiencies necessary for practical usage as long as the number of participants or time-points per window is not too large. The purpose of the present work is by no means to displace the grid method, which is clearly superior in some cases, but to instead establish a new option which has other benefits. Our results show that if the number of participants or timepoints per window is very high the analytical method is unfeasible, unless some other optimization or simplification is employed.

The agreement between the scores obtained by the analytical and grid methods is very good. This can be seen in Appendix E, serving as evidence of the correctness of the presented analytical method and the proper functioning of the implemented code.

The results in Table 4 and Fig. 5 demonstrate that the analytical method is very scalable to higher dimensions, and potentially useful for other applications, unlike the grid method. If a higher-dimension grid were to be used then the computation would drastically slow down. Suppose, for example, that a new dimension with 100 points were to be added. It is expected to slow down the calculation by a factor of more than 100. If, instead, one adds in two dimensions with that same number of points then the calculations would be over 10000 times slower, and also requiring more memory, hence rapidly becoming unfeasible. Adding another dimension to the analytical calculations is likely to produce a much smaller slowdown because every added dimension requires two additional error function calls for each matrix element, but the latter do not increase in their number regardless of how many dimensions are present.

Our code was written to keep the flow of the analytical calculations as similar as possible for both exact and approximate forms. The error function call in the code is simply replaced with its limit value given by Eq. 22 in the approximate calculation. We therefore attribute the difference in speed between exact and approximate calculations, observed in Table 4 and Fig. 5, to the error function calls.

It might be possible to optimize the performance of code even further for the approximate case. Aside from the error function, the exponential function call is likely to be one of the most time consuming operations. By exploiting the properties of the exponential functions one can change the product of exponential functions into a single exponential function applied to the algebraic sum of the exponents. By doing so, a single exponential call is needed regardless of how many dimensions exist. Such optimization is not currently done in our implementation due to consistency reasons, in order to assess the impact of the error function usage.

The analytical-approximate form may encounter accuracy difficulties at the edges of the screen. By extending the border of the screen by a few standard deviations this problem can be prevented, which is probably acceptable for many applications. If the stimuli of interest is not near the edges, and since the gazes tend to be center biased (Borji et al., 2013), it is likely that no adjustments will be needed. This is because any gazepoints that are very far from the others will not contribute to the Gaussian similarity nor to the overlap matrix, regardless of how close to the border they may be.

The analytical method is substantially slowed where there are a large number of time points per window, and where there are very large numbers of participants. However, this latter issue is greatly lessened if sampling rates are limited. For example, in a typical use case, researchers might want to analyze data from a large number of participants using commodity-level eye trackers. Not only are these systems limited to relatively slow sampling rates (30-90 Hz), but in many applications researchers are interested in only one gaze location per video frame. In such cases, the analytical method could accommodate hundreds of participants using a 33 ms window with 2-3 samples per window. The average time we obtain for 2000 participants with our code in such scenario is about 47 s per window, and beyond that number the time increases very rapidly (the parameters and results for this calculation are available in the code repository).

Other improvement strategies

It may be possible to further improve the analytical method’s efficiency, since Gaussians are highly localized distributions and therefore many of the matrix elements of S and θ might be very close to zero for those pairs of points very far apart, so these matrices could be treated as sparse hereby cutting down the number of operations required.

Strategies such as spatial hashing (Hastings, Mesit, & Guha, 2005) employed in molecular dynamics simulations (Griebel, Knapek, & Zumbusch, 2007) could significantly reduce quadratic complexities. Some of those strategies are known to be highly effective for gases in which particles are relatively far apart, reducing the quadratic complexity to near linear (See for example Varga & Driscoll, 2011; Griebel et al., 2007 and references within). Also, for some applications another possibility is to use masking (Lao et al., 2017), by only considering points within certain regions of the screen. In applications where the gazepoints are not concentrated these approaches could cut down the number of frames where full NSS calculations are needed.

Potential applications with higher dimensionality

The authors have identified some possible examples of high-dimensional applications where the method could be used.

One example could be data acquired while the participant is in motion, such as that in driving environments (either virtual or real). According to (Braunagel, Geisler, Rosenstiel, & Kasneci, 2017), these cases are problematic for geometric methods and also for methods based on attention maps, and methods such as iMap (Caldara & Miellet, 2011) cannot be applied. The reason is that the regions of interest would need to be recalculated, taking into consideration the vehicle or driver’s motion (Braunagel et al., 2017). The analytical NSS could potentially be used in an application like this one, by including the vehicle or driver’s coordinates as additional spatial dimensions. The greatest gaze similarity is then achieved when the participants look at the same area from the same place.

Another example is the case of immersive videos (Löwe, Stengel, Förster, Grogorick, & Magnor, 2017). These have become possible in recent years due to the availability of 360^∘ panoramic cameras, which allow participants to change the field of view by moving their head. By doing so they may be experiencing different stimuli, so there is a need to consider the gaze location as well as the viewing direction (Löwe et al., 2017). That additional variable has prompted Löwe et al., (2017) to develop new methods of analysis beyond the typical ones. We propose extending the analytical NSS to include angular variables so it can be applied in this scenario, by adding in an extra dimension. In such case one must account for the periodic nature of the angular variable, which prompts the usage of a circular distribution for those variables. A Gaussian approximation might be sufficient in many cases (See Jammalamadaka and SenGupta, 1999 ch. 2 for more details), making the formalism presented here adequate.

In studies were participants can scroll or zoom the digital contents of the screen at will it may happen that two of them may be looking at the same feature, but they may be exposed to different content. In a study on reading behaviors by Goodwin et al., (2019) such scenario was encountered, and it resembles the challenge encountered by Löwe et al., (2017). The analytical NSS could also handle this scenario by adding more dimensions to account for the screen visibility, as well as other variables related to reading behaviors from the participants.

The NSS similarity method doesn’t make any assumptions that are specific to gaze data and is vector based, so it may be applicable to other trajectory analysis problems too. This method is also much simpler than others and easier to interpret, although also more limited in scope. One important distinction to highlight is that this method was developed for measuring gaze similarity or hypothesis testing across groups or conditions (e.g. Dorr et al., 2010; Huff et al., 2017), while some of the some of the other existing ones are for discovering trajectory patterns. Nevertheless, there are many common elements. For example, the Gaussian similarity has already been used by Lou et al., (2009) in a trajectory data mining algorithm to determine the probability that two trajectory points match. It is also present in the affinity matrices used in some methods that are clustering-based (Brox and Malik, 2010, e.g.Hu et al., 2007). The presented method is suitable for cases where few moving objects are present, with no known information of where the areas of interest might be, and one is interested in obtaining a measure of proximity. Also this method gives each moving object an individual score indicating proximity, which can be used for other analyses.

Conclusions and remarks

We have presented a new method for calculating the NSS for gaze similarity measurements that yields the same results as the existing grid-based one, retaining all of its statistical properties, but with gains in accuracy and potentially large gains in speed. Not only do these improvements support new iterative analyses such as Monte Carlo approaches, but they also allow the extension of NSS analyses into settings requiring analysis of eye data for lengthy high definition videos because it maintains accuracy and speed regardless of screen resolution.

As theoretically expected, our implementation shows huge speed gains for the analytical calculation compared with the grid method in many, but not all, cases. The grid method remains a better choice if the number of participants or time points per window is large. However, strategies exist that can be used to counteract such obstacles for the analytical calculations.

The approximate form of the analytical NSS produced even larger gains in speed, although care must be taken if a large number of gaze points are near screen boundaries. Therefore, as a precaution we proposed either padding each frame or tracking the number of gaze points within certain number of standard deviations of the frame border if the approximation is to be employed.

The new analytical method presented here also shows very good dimensionality scaling, and it is efficient in either 2D or 3D. This can support new applications of the NSS not only for eye movement data, but possibly for more complex higher-dimensional applications, where the grid method is completely unfeasible. We have identified some specific possible examples of possible applications, and many others may exist involving biological signals such as fMRI and EEG.

The analytical framework presented could be used to further investigate other variations of the NSS measure and make use of the support for high-dimensional data and Monte Carlo methods, in tune with modern trends.

References

Abramowitz, M. (1974) Handbook of mathematical functions, with formulas, graphs, and mathematical tables. New York: Dover Publications, Inc.
Google Scholar
Anderson, N.C., Anderson, F., Kingstone, A., & Bischof, W.F. (2015). A comparison of scanpath comparison methods. Behavior Research Methods, 47(4), 1377–1392. https://doi.org/10.3758/s13428-014-0550-3
Article Google Scholar
Atluri, G., Karpatne, A., & Kumar, V. (2018). Spatio-temporal data mining: a survey of problems and methods. ACM Computing Surveys 51(4), 1–41. https://doi.org/10.1145/3161602
Bian, J, Tian, D, Tang, Y, & Tao, D (2018). A survey on trajectory clustering analysis. arXiv:1802.06971
Blascheck, T, Burch, M, Raschke, M, & Weiskopf, D (2015). Challenges and perspectives in big eye-movement data visual analytics. In 2015 Big Data Visual Analytics (BDVA), pp 1–8.
Borji, A., Sihite, D.N., & Itti, L. (2013). Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE Transactions on Image Processing, 22(1), 55–69. https://doi.org/10.1109/TIP.2012.2210727
Article Google Scholar
Braunagel, C., Geisler, D., Rosenstiel, W., & Kasneci, E. (2017). Online recognition of driver-activity based on visual scanpath classification. IEEE Intelligent Transportation Systems Magazine, 9(4), 23–36.
Article Google Scholar
Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.) Computer vision – ECCV, (Vol. 2010 pp. 282–295). Berlin: Springer.
Burch, M (2018). Identifying similar eye movement patterns with t-sne. In Proceedings of the conference on vision, modeling, and visualization, eurographics association, Goslar, DEU, EG VMV ’18, p 111–118, https://doi.org/10.2312/vmv.20181260
Caldara, R., & Miellet, S. (2011). imap: a novel method for statistical fixation mapping of eye movement data. Behavior Research Methods, 43(3), 864–878. https://doi.org/10.3758/s13428-011-0092-x
Article Google Scholar
Coutrot, A., Hsiao, J.H., & Chan, A.B. (2018). Scanpath modeling and classification with hidden markov models. Behavior Research Methods, 50 (1), 362–379. https://doi.org/10.3758/s13428-017-0876-8
Article Google Scholar
Cristino, F., Mathôt, S, Theeuwes, J., & Gilchrist, I.D. (2010). Scanmatch: A novel method for comparing fixation sequences. Behavior Research Methods, 42(3), 692–700. https://doi.org/10.3758/BRM.42.3.692
Article Google Scholar
Dewhurst, R., Nyström, M., Jarodzka, H., Foulsham, T., Johansson, R., & Holmqvist, K. (2012). It depends on how you look at it: Scanpath comparison in multiple dimensions with multimatch, a vector-based approach. Behavior Research Methods, 44(4), 1079–1100. https://doi.org/10.3758/s13428-012-0212-2
Article Google Scholar
Dorr, M, Martinetz, T, Gegenfurtner, KR, & Barth, E (2010). Variability of eye movements when viewing dynamic natural scenes. Journal of Vision, 10(10), 28–28. https://doi.org/10.1167/10.10.28. https://jov.arvojournals.org/arvo/content_public/journal/jov/932797/jov-10-10-28.pdf
Article Google Scholar
Foerster, RM, & Schneider, WX (2013). Functionally sequenced scanpath similarity method (funcsim): Comparing and evaluating scanpath similarity based on a task’s inherent sequence of functional (action) units. Journal of Eye Movement Research 6(5), 1–22. https://doi.org/10.16910/jemr.6.5.4. https://bop.unibe.ch/JEMR/article/view/2368
Frigo, M, & Johnson, SG (1998). FFTW: An adaptive software architecture for the FFT. In Proceedings of the international conference on acoustics, speech, and signal processing, ICASSP, Seattle, Washington. http://www.fftw.org/, (Vol. 3 pp. 1381–1384).
Goodwin, A.P., Cho, S.J., Reynolds, D., Brady, K., & Salas, J. (2019). Digital versus paper reading processes and links to comprehension for middle school students. American Educational Research Journal, p 0002831219890300. https://doi.org/10.3102/0002831219890300
Grant, ER, & Spivey, MJ (2003). Eye movements and problem solving: Guiding attention guides thought. Psychological Science, 14(5), 462–466. https://doi.org/10.1111/1467-9280.02454, pMID: 12930477.
Article Google Scholar
Griebel, M., Knapek, S., & Zumbusch, G. (2007) Numerical simulation in molecular dynamics: Numerics, algorithms, parallelization, applications texts in computational science and engineering. Germany: Springer-Verlag.
Google Scholar
Hastings, E.J., Mesit, J., & Guha, R.K. (2005). Optimization of large-scale, real-time simulations by spatial hashing. In Proc. 2005 summer computer simulation conference, society for modeling & simulation international, new jersey, USA, (Vol. 37 pp. 9–17).
Hooge, I, & Camps, G (2013). Scan path entropy and arrow plots: Capturing scanning behavior of multiple observers. Frontiers in Psychology, 4, 996. https://doi.org/10.3389/fpsyg.2013.00996, https://www.frontiersin.org/article/10.3389/fpsyg.2013.00996
Article Google Scholar
Hu, W., Xie, D., Fu, Z., Zeng, W., & Maybank, S. (2007). Semantic-based surveillance video retrieval. IEEE Transactions on Image Processing, 16(4), 1168–1181.
Article Google Scholar
Huff, M, Papenmeier, F, Maurer, AE, Meitz, TGK, Garsoffky, B, & Schwan, S (2017). Fandom biases retrospective judgments not perception. Scientific Reports, 7, 43083. EP –, https://doi.org/10.1038/srep43083, article.
Article Google Scholar
Intel (2017). Intel math kernel library (mkl). https://software.intel.com/en-us/mkl
Jammalamadaka, S.R., & SenGupta, A. (1999). Topics in circular statistics world scientific publishing co pte ltd, Singapore.
Jarodzka, H, Holmqvist, K, & Nyström, M (2010). A vector-based, multidimensional scanpath similarity measure. In Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications, Association for Computing Machinery, New York, NY, USA, ETRA ’10, p 211–218, https://doi.org/10.1145/1743666.1743718 , https://doi-org.proxy.library.vanderbilt.edu/10.1145/1743666.1743718
Lao, J, Miellet, S, Pernet, C, Sokhn, N, & Caldara, R (2017). imap4: An open source toolbox for the statistical fixation mapping of eye movement data with linear mixed modeling. Behavior Research Methods, 49(2), 559–575. https://doi.org/10.3758/s13428-016-0737-x
Article Google Scholar
Loschky, L.C., Larson, A.M., Magliano, J.P., & Smith, T.J. (2015). What would jaws do? the tyranny of film and the relationship between gaze and higher-level narrative film comprehension. PLOS ONE, 10(11), 1–23. https://doi.org/10.1371/journal.pone.0142474
Article Google Scholar
Lou, Y., Zhang, C., Xie, X., Zheng, Y., Wang, W., & Huang, Y. (2009). Map-matching for low-sampling-rate gps trajectories. In Proceedings of 18th ACM SIGSPATIAL conference on advances in geographical information systems. https://doi.org/https://www.microsoft.com/en-us/research/publication/map-matching-for-low-sampling-rate-gps-trajectories/
Löwe, T, Stengel, M., Förster, E.C., Grogorick, S., & Magnor, M. (2017). Gaze visualization for immersive video. In M. Burch, L. Chuang, B. Fisher, A. Schmidt, & D. Weiskopf (Eds.) Eye tracking and visualization (pp. 57–71). Cham: Springer International Publishing.
Lukavský, J. (2013). Eye movements in repeated multiple object tracking. Journal of Vision, 13(7), 9–9. https://doi.org/10.1167/13.7.9. https://jov.arvojournals.org/arvo/content_public/journal/jov/932812/i1534-7362-13-7-9.pdf
Article Google Scholar
Manly, B. (2006) Randomization bootstrap and monte carlo methods in biology, (3rd edn.) Boca Raton, FL: Chapman & Hall texts in statistical science series, Taylor & Francis.
Google Scholar
Marat, S., Ho Phuoc, T., Granjon, L., Guyader, N., Pellerin, D., & Guérin-Dugué, A. (2009). Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, 82(3), 231. https://doi.org/10.1007/s11263-009-0215-3
Article Google Scholar
Mathôt, S., Cristino, F., Gilchrist, I.D., & Theeuwes, J. (2012). A simple way to estimate similarity between pairs of eye movement sequences. Journal of Eye Movement Research 5(1), 1–15. https://doi.org/10.16910/jemr.5.1.4. https://bop.unibe.ch/JEMR/article/view/2326
Peters R.J., Iyer A., Itti L., & Koch C. (2005). Components of bottom-up gaze allocation in natural images. https://doi.org/10.1016/j.visres.2005.03.019. http://www.sciencedirect.com/science/article/pii/S0042698905001975, (Vol. 45 pp. 2397–2416).
Press, W.H., Teukolsky, S.A., Vetterling, W.T., & Flannery, B.P. (2007) Numerical recipes 3rd edition: The art of scientific computing, (3rd edn). New York: Cambridge University Press.
Google Scholar
Riche, N, Mancas, M, Duvinage, M, Mibulumukini, M, Gosselin, B, & Dutoit, T (2013). Rare2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis. Signal Processing: Image Communication, 28(6), 642–658. https://doi.org/10.1016/j.image.2013.03.009. http://www.sciencedirect.com/science/article/pii/S0923596513000489
Google Scholar
Sikha, O, Kumar, SS, & Soman, K (2018). Salient region detection and object segmentation in color images using dynamic mode decomposition. Journal of Computational Science, 25, 351–366. https://doi.org/10.1016/j.jocs.2017.07.007. http://www.sciencedirect.com/science/article/pii/S1877750317308049
Article Google Scholar
Tanner, J, & Itti, L (2019). A top-down saliency model with goal relevance. Journal of Vision, 19(1), 11–11. https://doi.org/10.1167/19.1.11, https://jov.arvojournals.org/arvo/content_public/journal/jov/937737/i1534-7362-19-1-11.pdf
Article Google Scholar
Torralba, A., Castelhano, M.S., Oliva, A., & Henderson, J.M. (2006). Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review, 113, 2006.
Article Google Scholar
Varga, K., & Driscoll, J.A. (2011) Computational nanoscience, (1st edn). USA: Cambridge University Press.
Book Google Scholar
Vetterling, W., & Press, W. (1992) Numerical recipes in FORTRAN: the art of scientific computing, (2nd edn). USA: Fortran numerical recipes, Cambridge University Press.
Google Scholar
Wooding, D.S. (2002). Eye movements of large populations: Ii. deriving regions of interest, coverage, and similarity using fixation maps. Behavior Research Methods Instruments, & Computers, 34(4), 518–528. https://doi.org/10.3758/BF03195481
Article Google Scholar
Zheng, Y. (2015). Trajectory data mining: an overview. ACM Trans Intell Syst Technol 6(3), 1–41. https://doi.org/10.1145/2743025

Download references

Acknowledgements

The authors would like to thank Prof Amanda Goodwin.

Author information

Authors and Affiliations

Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA
Jorge A. Salas
Department of Psychology, Vanderbilt University, Nashville, TN, USA
Daniel T. Levin

Authors

Jorge A. Salas
View author publications
You can also search for this author in PubMed Google Scholar
Daniel T. Levin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge A. Salas.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Availability of data and material

The Fotran source code, the benchmark results, the configuration files with the parameters used, and others required to execute the code are available in the Github.com repository https://github.com/jasdevelop/NSS_similarity.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by NSF grant 1623625.

Appendices

Appendix A: Mathematical details

Let ϕ be a Gaussian function centered at x₀:

$$ \begin{array}{@{}rcl@{}} \phi(x_{0})= e^{-\nu (x-x_{0})^{2} } . \end{array} $$

(25)

A.1 Single Gaussian integral

The integral of a single 1D Gaussian function, centered at x₀ in the interval (p,q) is

$$ \begin{array}{@{}rcl@{}} \alpha(p, q, x_{0}, \nu) &=& {{\int}_{p}^{q}} e^{-\nu (x-x_{0})^{2} } dx \end{array} $$

(26)

$$ \begin{array}{@{}rcl@{}} &=& \frac{ \sqrt{\pi} }{2\sqrt{\nu} } \left[ erf(\sqrt{\nu}(q-x_{0}) ) \right.\\&&\left.- erf(\sqrt{\nu}(p-x_{0}) ) \right] \end{array} $$

(27)

A.2 Overlap integral

The overlap integral between two Gaussian functions is commonly encountered in many applications such as in quantum mechanics (see for example Varga & Driscoll 2011 and references within), and is one of the core elements for the framework shown in the present work. We here derive its solution.

The overlap integral in 1D for the interval (p,q) for two Gaussians with the same width is given by

$$ \begin{array}{@{}rcl@{}} R(p,q,x_{1},x_{2}, \nu)&=&{{\int}_{p}^{q}}\phi(x_{1})\phi(x_{2}) dx \end{array} $$

(28)

$$ \begin{array}{@{}rcl@{}} &=& {{\int}_{p}^{q}} e^{-\nu(x-x_{1})^{2} -\nu(x-x_{2})^{2} } dx \end{array} $$

(29)

The exponent can be rearranged as

$$ \begin{array}{@{}rcl@{}} -\left( 2\nu x^{2}-2\nu(x_{1}+x_{2})x+\nu\left( {x^{2}_{1}}+{x^{2}_{2}}\right) \right) \end{array} $$

(30)

We can then express it in quadratic form by defining:

$$ \begin{array}{@{}rcl@{}} a&\equiv&2\nu \\ b&\equiv&-\nu(x_{1}+x_{2}) \\ c&\equiv&\nu({x^{2}_{1}}+{x^{2}_{2}}) \end{array} $$

From Eq.7.4.32 in (Abramowitz, 1974) we have

$$ \begin{array}{@{}rcl@{}} \int e^{-(ax^{2}+2bx+c) } dx &=& \frac{1}{2}\sqrt{\frac{\pi}{a}}\exp{\left( \frac{b^{2}-ac}{a}\right)}erf\\&& \left( \sqrt{a}x +\frac{b}{\sqrt{a}}\right) + const. \end{array} $$

Rearranging terms in the exponent in Eq. 31 and substituting a, b and c:

$$ \begin{array}{@{}rcl@{}} \frac{b^{2}-ac}{a} &=& \frac{ \nu(x_{2}+x_{1})^{2} }{2} -\left( {x^{2}_{1}}+{x^{2}_{2}}\right)\nu \\ &=& -\frac{\nu}{2}(x_{1}-x_{2})^{2} \end{array} $$

We conveniently define the quantities

$$ \begin{array}{@{}rcl@{}} x_{-}&=&x_{1}-x_{2} \end{array} $$

(31)

$$ \begin{array}{@{}rcl@{}} x_{+}&=&\frac{x_{1}+x_{2}}{2}. \end{array} $$

(32)

Evaluating the indefinite integral in the interval (p,q) we therefore get the sought result:

$$ \begin{array}{@{}rcl@{}} &&R(p,q,x_{1},x_{2}, \nu)\\&=&\exp{\left( -\nu x_{-}^{2}/2 \right) } \frac{1}{2}\sqrt{\frac{\pi}{2\nu}}\\&&\times\left[erf\left( \sqrt{2\nu}(q-x_{+})\right) - erf\left( \sqrt{2\nu}(p-x_{+})\right) \right] \end{array} $$

It can also be written in terms of the integral of a single Gaussian:

$$ \begin{array}{@{}rcl@{}} R(p,q,x_{1},x_{2}, \nu)&=& \exp{\left( -\nu x_{-}^{2}/2 \right) } \alpha(p, q, x_{+}, 2\nu) , \end{array} $$

(33)

which is the form used in our code since it reuses the routine for the single integral.

A.3 Derivations

In the present section some detailed derivations are presented for the average and standard deviation of the gaze maps.

A.3.1 Average field of gaze map

The average value of a continuous function over the interval (p,q) is given by the integral

$$ \bar{f} =\frac{ {{\int}_{p}^{q}} f(x) dx }{ {{\int}_{p}^{q}} dx } = \frac{1}{L} {{\int}_{p}^{q}} f(x) dx , $$

(34)

where L = q − p, which is the length of the interval.

For the 3D Gaussian function we therefore have that the integral is:

$$ \begin{array}{@{}rcl@{}} &&\iiint\phi(x_{i}, y_{i}, t_{i}) dxdydt\\ &=& \alpha(p_{x}, q_{x}, x_{i}, \nu_{x})\alpha(p_{y}, q_{y}, y_{i}, \nu_{y})\alpha(p_{t}, q_{t}, t_{i}, \nu_{t}) , \end{array} $$

where p,q are the boundaries of the box.

The average field for the continuum case is then given by

$$ \begin{array}{@{}rcl@{}} \bar{f } &=&\frac{1}{L_{x}L_{y}L_{t}} \iiint \underset{i}{\sum} \phi(x_{i}, y_{i}, t_{i}) dxdydt \\ &=& \frac{1}{L_{x}L_{y}L_{t}} \underset{i}{\sum} \iiint\phi(x_{i}, y_{i}, t_{i}) dx dy dt \end{array} $$

So the average scalar field formed by N gazepoints is then given by

$$ \begin{array}{@{}rcl@{}} \bar{f } &=& \frac{1}{L_{x}L_{y}L_{t}} \sum\limits_{i=1}^{N} \alpha(p_{x}, q_{x}, x_{i}, \nu_{x})\alpha(p_{y}, q_{y}, y_{i}, \nu_{y})\alpha(p_{t}, q_{t}, t_{i}, \nu_{t}),\\ \end{array} $$

(35)

where L_x,L_y,L_t are the sizes of the calculation box on each dimension.

For convenience, we employ an array with a size equal to the number of participants n_u, which has elements equal to the contribution of participant u to the average scalar field:

$$ \begin{array}{@{}rcl@{}} \bar{f}_{u} &=& \frac{1}{L_{x}L_{y}L_{t}} \sum\limits_{i=1}^{M_{u}} \alpha(p_{x}, q_{x}, x_{i}, \nu_{x})\alpha(p_{y}, q_{y}, y_{i}, \nu_{y})\alpha(p_{t}, q_{t}, t_{i}, \nu_{t})\\ \end{array} $$

(36)

We can therefore calculate the average field formed by the gaze of a group of participants g, excluding u, as:

$$ \begin{array}{@{}rcl@{}} \bar{F}_{u, g} =\underset{u^{\prime} \in g \wedge u^{\prime} \neq u }{\sum} \bar{f}_{u^{\prime}} \end{array} $$

(37)

A.3.2 Field Standard deviation

The standard deviation of the scalar field F(x) is given by:

$$ \begin{array}{@{}rcl@{}} Std(F(\mathbf{x})) = \sqrt{ Var(F(\mathbf{x}))} \end{array} $$

(38)

The variance of F(x) can be obtained by integration over the volume of the box V = L_xL_yL_t from

$$ \begin{array}{@{}rcl@{}} Var\left( F(\mathbf{x} ) \right) = \frac{1}{V} {\int}_{V} \left( F(\mathbf{x} ) - \bar{F} \right)^{2} d\mathbf{x}, \end{array} $$

(39)

which can also be expressed as:

$$ \begin{array}{@{}rcl@{}} Var\left( F(\mathbf{x}) \right) = \frac{1}{V} \int F^{2}(\mathbf{x} ) d\mathbf{x} - \bar{F}^{2} \end{array} $$

(40)

So the variance can be obtained by calculating the integral of the squared scalar field F(x). If the field is composed by a superposition of N Gaussian functions ψ then

$$ \begin{array}{@{}rcl@{}} \int F(x )^{2} d\mathbf{x} &=& \int \left( \sum\limits_{i=1}^{N} \phi_{i} \right)\left( \sum\limits_{j=1}^{N} \phi_{j} \right) d\mathbf{x} \end{array} $$

(41)

$$ \begin{array}{@{}rcl@{}} &=&\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{N} \int \phi_{i}\phi_{j} d\mathbf{x} \end{array} $$

(42)

$$ \begin{array}{@{}rcl@{}} &=& \sum\limits_{i=1}^{N}\sum\limits_{j=1}^{N} {{\varTheta}}(\mathbf{r}_{i},\mathbf{r}_{j}) \end{array} $$

(43)

where the sum is done over all the Gaussians. The integral in Eq. 42 is a 3D overlap integral (see Eq. 28).

We then define an overlap matrix, making use of the definition in Eq. 28. It will have a shape of N × N with elements

$$ \begin{array}{@{}rcl@{}} {{\varTheta}}(\mathbf{r}_{i},\mathbf{r}_{j}) &=& \int \phi_{i}\phi_{j} d\mathbf{x}\\ &=& R(p_{x},q_{x}, x_{i},x_{j})R(p_{y},q_{y}, y_{i},y_{j}) R(p_{t},q_{t}, t_{i},t_{j}). \end{array} $$

(44)

The standard deviation of the map corresponding to group g can then be written as

$$ \begin{array}{@{}rcl@{}} Std(F_{u, g}(\mathbf{x})) = \sqrt{ \frac{1}{V} \underset{i^{\prime}}{\sum}\underset{j^{\prime}}{\sum} {{\varTheta}}(\mathbf{r}_{i^{\prime}} ,\mathbf{r}_{j^{\prime}}) - \bar{F}_{u ,g}^{2} } . \end{array} $$

(45)

We now aggregate the overlap matrix elements by participant, for which we remap each single index to a pair of indices as

$$ \begin{array}{@{}rcl@{}} \mathbf{r}_{i^{\prime}} \rightarrow \mathbf{r}_{i,u}, \end{array} $$

(46)

where r_i,u is the ith gaze point of participant u within the time window. Hence,

$$ \begin{array}{@{}rcl@{}} {{\varTheta}}(\mathbf{r}_{i^{\prime}},\mathbf{r}_{j^{\prime}}) = {{\varTheta}}(\mathbf{r}_{i,u} ,\mathbf{r}_{j,u^{\prime}} ). \end{array} $$

(47)

We then define an overlap matrix at the participant level which has a shape n_u × n_u. Each new matrix element is the sum of all overlap matrix elements between the Gaussians of u and $u^{\prime }$ within the time window. Assuming there are n_t gaze points in the time window then the participant-level overlap matrix is given by

$$ \begin{array}{@{}rcl@{}} \theta_{u,u^{\prime} } =\sum\limits_{i=1}^{n_{t}}\sum\limits_{j=1}^{n_{t}}{{\varTheta}}(\mathbf{r}_{i,u} ,\mathbf{r}_{j,u^{\prime}} ) , \end{array} $$

(48)

where the sum is done over all valid gaze points of those participants within the window. Any missing/invalid gaze points can be omitted from the sum; or equivalently one can consider them as positioned infinitely far away and hence their value is zero and do not contribute.

Since θ is the participant-aggregated form of Θ, the sums of all their matrix elements are equal to each other:

$$ \begin{array}{@{}rcl@{}} \sum\limits_{i=1}^{N} \sum\limits_{j=1}^{N} {{\varTheta}}(\mathbf{r}_{i} ,\mathbf{r}_{j} ) =\sum\limits_{u=1}^{n_{u}}\sum\limits_{u^{\prime}=1}^{n_{u}} \theta_{u,u^{\prime} } . \end{array} $$

(49)

The standard deviation for a field for participant u due to a group g, can therefore be obtained using Eq. 49 and substituting into Eq. 45, but summing only over those pairs of participants within the group and excluding u:

$$ \begin{array}{@{}rcl@{}} Std(F_{u, g}(\mathbf{x} ) ) = \sqrt{ \frac{1}{V} \underset{u^{\prime} ,u^{\prime\prime} \in g \wedge u^{\prime} ,u^{\prime\prime} \neq u }{\sum} \theta_{u^{\prime},u^{\prime\prime} } - \bar{F}^{2}_{u,g } } . \end{array} $$

(50)

Appendix B: Code Implementation notes

For the present work, we implemented both the grid based NSS and analytical NSS (exact and approximate) in Fortran 2008 for the GNU compiler, which is freely available for all major platforms. There are multiple reasons for such choice of programming language: it is well known for its efficiency and extensively used for scientific computing, it allows easy array manipulations and vectorization, it supports the error function natively, and numerical libraries are available. It can also be difficult to develop efficient code in higher level languages such as Python or MATLAB, where any improvement to the algorithm can easily be undermined by overhead times, possibly invalidating the benchmarks.

We made use of the well-known FFTW (Fastest Fourier Transform in the West)(Frigo & Johnson, 1998), interfaced via Intel’s Math Kernel Library MKL (Intel, 2017). This library is freely available for all major operating systems, and it also provides some well-known random number generators for scientific works. Another candidate language for implementing these algorithms is C/C++, which can also make use of FFTW and MKL.

Despite that the grid-based method can feel more intuitive and appear easier implement, it may not necessarily be the case. Careful considerations such as such as zero-padding, shifting and boundary box delimitation are needed to implement the convolution correctly and efficiently. More detailed guidance for programming FFT-based convolutions can be found in Vetterling and Press (1992) and Press et al., (2007) for Fortran and C/C++ respectively. The authors spent considerably more time developing and debugging the grid calculation routines than the analytical ones.

It can be possible to calculate Eq. 7 easily using Gaussian filters or convolution operations from existing software libraries, depending on what language is employed. However, one can reduce the computation time by implementing the convolution operation by oneself, which we did in the present work. One reason is that one can ensure that all the arrays are allocated at the start and avoid repetitive memory allocations during the calculations. Also, the grid with the Fourier Transform of the Gaussian kernel, $\mathcal {F}(h)$ (see Eq. 7), is has a constant value during the entire video, so it only needs to be created once at the beginning of the video and stored. If so then one only requires two calls to the FFT (one forward and one inverse) instead of three calls for each gaze map. Since the FFT is the most likely bottleneck the improvement in speed can be noticeable.

Another issue that affects performance is memory usage. FFTW has routines that use the same input array as output (in-place operation), which is more efficient than having to copy the array. Once the statistics of the gaze map are calculated there is no need to store it anymore, and there is no need to simultaneously access the binary mask, so these arrays can be shared and therefore one can employ in-place operations.

The FFTW library has different algorithms available. It is certainly possible to make use of the different ones depending on the specific parameters of the calculation in order to gain some speed, but for simplicity and consistency a 3D-complex to 3D-complex FFT was used, which suits all possible scenarios studied in the present work.

The results presented here are normalized respect to the NSS score of a single Gaussian, in a similar manner as done by Dorr et al., (2010). The single Gaussian is placed in the center of the box in our code, for both grid and analytical methods. This calculation is done once and the result stored.

Appendix C: Hardware notes

All the calculations shown here were done using a personal computer with an Intel Core i5 processor, on a Kubuntu 16 Linux distribution running on an Oracle VM Virtual Box. The authors have previously tested the performance of other scientific computing applications and have found that the execution times are comparable to those obtained on a native operating system.

Appendix D: Synthetic data

A simple scenario was crafted to test out the speed of the calculations. The synthetic gaze data were made to have a rectilinear uniform motion across the screen, with individual participants constrained to separate equally-spaced parallel horizontal lanes (reminiscent of an Olympic swimming pool). The gazes move at a speed such that they cross the screen width in five time windows. The number of iterations of the code is exactly the same regardless of the trajectories selected for the synthetic gaze data.

One could expect some possible variability in the number of floating point operations (FLOPS) due to the exponential and error function calls, which could have some dependence on the values calculated. To ensure our benchmarks are not strongly affected by this, the calculations for parameter set D were repeated with random gaze position values. The execution times differed by 3% or less, hence we do not expect the synthetic data to give unrealistic results respect to real data.

Appendix E: Accuracy

In order to compare the NSS scores another calculation was done, using some simulated data, as an accuracy test to demonstrate the agreement between the grid and analytical methods. The NSS scores for the case of two observers, one at the center of the screen and the other separated by a distance d, were calculated with the analytical-exact and grid methods in 2D and 3D. The standard deviation σ_x, was adjusted such that it corresponds to 1/50 of the screen’s horizontal length. The temporal sampling rate was set to 30 Hz and the screen resolutions of 320x180 and 1280x720 pixels were used. The full simulation parameters are available in the repository.

The results are plotted in Fig. 8, and a table with the results is available in the repository. The distance d has been normalized respect to the standard deviation σ_x.

In general there is very good agreement between the scores, and hence the red and blue markers in the plot are largely overlapping, with some deviations. One can observe some small discretization artifacts, in the manner of the step-like patterns, for the low resolution case. These artifacts are not evident anymore for the high resolution results presented, which also display smaller differences between both methods, as expected.

For the 3D case the scores are overall lower than for 2D as a consequence of the existence of the temporal dimension, which increases the volume of the box, and because the gaze is always moving along the time axis. Also, the grid method is in essence doing a numerical integration, so the inclusion of the temporal axis is therefore expected to introduce more error, which is indeed observed here (note that in the presented calculations only the spatial resolution is different, with the temporal being the same). At short distances the agreement between methods is higher for the 2D case than for 3D, with the difference obtained being about 3% for the latter, which may be reasonable for many purposes. It can be improved by increasing the number of time gridpoints (not done here). Nevertheless, one is typically interested in doing comparisons between scores, hence systematic errors may cancel out and work well in practice.

Note that the results shown in the present section are only meant to be a guideline, for they correspond to this specific scenario and parameters, and could vary for other circumstances. Acceptable error levels are likely to be dependent on the specific application for which the method is used.

The agreement between the calculations serves as evidence of the validity of the analytical method, and that the implemented code is functioning correctly. The calculations for both grid and analytical methods are done by the code using separate modules and functions, with the common part being the input loading portions, which further sustains the validity.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Salas, J.A., Levin, D.T. Efficient calculations of NSS-based gaze similarity for time-dependent stimuli. Behav Res 54, 94–116 (2022). https://doi.org/10.3758/s13428-021-01562-0

Download citation

Accepted: 08 February 2021
Published: 09 June 2021
Issue Date: February 2022
DOI: https://doi.org/10.3758/s13428-021-01562-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient calculations of NSS-based gaze similarity for time-dependent stimuli

Abstract

Similar content being viewed by others

RETRACTED ARTICLE: Eye tracking: empirical foundations for a minimal reporting guideline

PsychoPy2: Experiments in behavior made easy

Guided Search 6.0: An updated model of visual search

Introduction

The overall algorithm

Definitions and notation

The Grid-based calculation

Algorithm complexity of the grid NSS

The analytical calculation

Raw score

Average gaze map - exact form

Standard deviation of gaze map - exact form

Approximate forms

Algorithm complexity of the analytical NSS

Theoretical comparison between grid and analytical methods

Monte Carlo

Speed performance benchmarks

2D calculation benchmark

3D calculation benchmark

Monte Carlo benchmark

Discussion

Other improvement strategies

Potential applications with higher dimensionality

Conclusions and remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Availability of data and material

Publisher’s note

Appendices

Appendix A: Mathematical details

A.1 Single Gaussian integral

A.2 Overlap integral

A.3 Derivations

A.3.1 Average field of gaze map

A.3.2 Field Standard deviation

Appendix B: Code Implementation notes

Appendix C: Hardware notes

Appendix D: Synthetic data

Appendix E: Accuracy

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation