Skip to main content

Optimizing the maximum reported cluster size in the spatial scan statistic for survival data

Abstract

Background

The spatial scan statistic is a useful tool for cluster detection analysis in geographical disease surveillance. The method requires users to specify the maximum scanning window size or the maximum reported cluster size (MRCS), which is often set to 50% of the total population. It is important to optimize the maximum reported cluster size, keeping the maximum scanning window size at as large as 50% of the total population, to obtain valid and meaningful results.

Results

We developed a measure, a Gini coefficient, to optimize the maximum reported cluster size for the exponential-based spatial scan statistic. The simulation study showed that the proposed method mostly selected the optimal MRCS, similar to the true cluster size. The detection accuracy was higher for the best chosen MRCS than at the default setting. The application of the method to the Korea Community Health Survey data supported that the proposed method can optimize the MRCS in spatial cluster detection analysis for survival data.

Conclusions

Using the Gini coefficient in the exponential-based spatial scan statistic can be very helpful for reporting more refined and informative clusters for survival data.

Background

The spatial scan statistic is a useful and widely used tool for detecting spatial or space–time clusters in disease surveillance. The method has been developed for different types of data such as count [1], ordinal [2, 3], survival [4], continuous [5,6,7], and multinomial [8]. The software SaTScan™ [9], available for free, enhances the ease of access to this method for researchers.

The spatial scan statistic is formulated based on the likelihood ratio test statistic. A large number of scanning windows of various sizes across all locations are first constructed on the entire study area. Each scanning window is a candidate for the most likely cluster. In SaTScanâ„¢, circular or elliptical scanning windows are considered. The likelihood ratio test statistic is calculated for each window to compare its inside and outside. The scanning window with the maximum value of the likelihood ratio test statistic is defined as the most likely cluster. Secondary clusters with high test statistic values are also reported.

Cluster detection results can be sensitive to the maximum scanning window size (MSWS), as studied by Riberiro and Costa [10]. In SaTScanâ„¢, users can specify the MSWS, which is set to 50% of the total population by default. A high MSWS and a high maximum reported cluster size (MRCS) could result in an excessively large cluster. Some researchers try different MSWS values to obtain seemingly good results without knowing the MRCS. Repeatedly performing spatial cluster detection analyses using different values of MSWS leads to a multiple testing problem, as pointed out by Han et al. [11]. We can consider different values of MRCS with a fixed MSWS to avoid this problem. Still, we need to choose the optimal value of the MRCS. The clusters reported by subjectively chosen MRCS may be different from the true clusters.

Han et al. [11] proposed a criterion measure to optimize the MRCS for the Poisson-based spatial scan statistic. They defined the Gini coefficient to represent the degree of heterogeneity of disease clusters for count data. Their simulation study showed that the Gini coefficient can be useful for selecting the best MRCS to obtain a refined collection of clusters. Interestingly, by reporting an optimized and refined collection of clusters rather than a single large cluster, the Gini coefficient allows us to better identify irregularly shaped ones [12].

As the formulation of test statistics of the spatial scan statistic is different for different models, the Gini coefficient should be clearly and distinctly defined for each model and thoroughly evaluated. The Gini coefficients for the ordinal- and normal-based spatial scan statistics were proposed by Kim and Jung [13] and by Yoo and Jung [14], respectively. In this paper, we defined the Gini coefficient for the exponential-based spatial scan statistic, which is used for survival data. Through an extensive simulation study under various scenarios, we showed that the proposed method is very useful for optimizing the MRCS for the exponential-based spatial scan statistic. We illustrated the method using Community Health Survey data collected by the Korea Centers for Disease Control and Prevention.

Methods

Poisson model and the Gini coefficient

When we have count data such as the number of certain disease occurrences according to an underlying population at risk in a study region, we can use the Poisson-based spatial scan statistic [1]. We are often interested in identifying areas with high disease incidence rates. The null and alternative hypotheses are written as

$${H_0}:p = q\;{\text{for all}}\;z \in Z\;vs.\;{H_a}:p > q\;{\text{for some}}\;z \in Z$$

where p and q are the intensities of the outcome variable inside and outside the scanning window \(z\), respectively, and Z denotes the collection of all scanning windows. The likelihood ratio test statistic given window \(z\) is expressed as

$$LR\left(z\right)=\frac{{\left(\frac{{c}_{z}}{{n}_{z}}\right)}^{{c}_{z}}{\left(\frac{C-{c}_{z}}{N-{n}_{z}}\right)}^{{C-c}_{z}}}{{\left(\frac{C}{N}\right)}^{C}}$$

if \({c}_{z}/{n}_{z}>({C-c}_{z})/({N-n}_{z})\), and \(LR\left(z\right)=1\) otherwise. In the above equation, \({c}_{z}\)and \({n}_{z}\) denote the observed number of cases and population within window z. \(C\) and \(N\) denote the total number of cases and population in the whole study area, respectively.

The scanning window that maximizes the value of \(LR\left(z\right)\) is the most likely cluster. Statistical inference for the most likely cluster can be performed using Monte Carlo hypothesis testing. In addition, secondary clusters with high values of the likelihood ratio test statistic are often of interest. The p-values of the secondary clusters are typically obtained in the same manner as the null hypothesis is rejected on own strength.

When reporting the most likely and secondary clusters, the Gini coefficient can be used to find a more refined collection of non-overlapping clusters. In economics, the Gini coefficient was developed to indicate the degree of heterogeneity of wealth distribution [15]. As a summary measure of the Lorenz curve, the larger the Gini coefficient, the higher the heterogeneity in wealth. Han et al. [11] adopted the Gini coefficient in the spatial scan statistic for count data to measure the degree of heterogeneity in the spatial distribution of disease cases by defining the x-axis of the Lorenz curve as the cumulative proportion of the number of disease cases and the y-axis as the cumulative proportion of the population. Its value is calculated as twice the area between the Lorenz curve and the 45° line, which indicates that the number of cases is proportional to the population of each region. When there is only one significant cluster, the Lorenz curve is constructed as a line graph connecting the three points (0,0), (\({x}_{1},{y}_{1}\)), and (1,1), where \({x}_{1}\) and \({y}_{1}\) are the proportions of observed cases and population (expected cases) in the cluster. As more cases are concentrated in the cluster than expected, \({x}_{1}\) increases and the Lorenz curve moves farther away from the reference line. The Gini coefficient also increases. When we have K multiple clusters, the Lorenz curve connects K points between (0,0) and (1,1). The coordinates of each cluster \(({x}_{k},{y}_{k})\) are defined as \({x_k} = \left( {\frac{1}{C}} \right)\mathop \sum \nolimits_{j = 1}^k {c_j}\) and \({y_k} = \left( {\frac{1}{N}} \right)\mathop \sum \nolimits_{j = 1}^k {n_j}\) where \({c}_{j}\) and \({n}_{j}\) are the number of cases and population in the \(j\)-th cluster. The Gini coefficient can be calculated as \({\sum }_{k=1}^{K+1}({y}_{k}{x}_{k-1}-{y}_{k-1}{x}_{k})\) with \({x}_{0}={y}_{0}=0\) and \({x}_{K+1}={y}_{K+1}=1.\) The Gini coefficient values range from 0 to 1. We select the best collection of clusters to report the highest Gini coefficient value from among several competing sets of clusters. Han et al. [11] included more detailed information. The Gini coefficient has been implemented in SaTScan™ for the Poisson and Bernoulli models.

Spatial scan statistic for survival data

Different spatial scan statistics for survival data have been proposed based on different models, including Weibull and generalized life distributions [16, 17]. Huang et al. [4] proposed a spatial scan statistic for survival data based on an exponential model. We focused on the exponential model. The exponential-based spatial scan statistic has been used to examine geographic disparities in survival in cancer patients [18,19,20].

Suppose we have survival data for I subjects in a study area, such as time to death for cancer patients. Let \({T}_{i}\) and \({L}_{i}\) be the survival time and fixed censoring time for the \(i\) th subject, respectively. We assume that \({T}_{i}\) is exponentially distributed with a probability density function \(f\left( {{T_i}} \right) = \frac{1}{\theta }{e^{ - \frac{{{T_i}}}{\theta }}},\;\theta > 0.\) Parameter \(\theta\) represents mean survival time. The observed time \({t_i} = \min \left( {{T_i},{L_i}} \right).\) Let \({\delta _i}\)be the censoring indicator, that is, \({\delta _i} = 1{\text{ if }}{T_i} \leqslant {L_i}\) and \({\delta _i} = 0\;{\text{if }}{T_i} > {L_i}\) To identify clusters of short survival, the null and alternative hypotheses are written as:

$${H_0}:{\theta _{{\text{in}}}} = {\theta _{{\text{out}}}}\;{\text{for all}}\;z \in Z\;vs.\;{H_a}\;{\theta _{{\text{in}}}} < {\theta _{{\text{out}}}}\;{\text{for some}}\;z \in Z$$

where \({\theta }_{\mathrm{i}\mathrm{n}}\) denotes the mean survival time for subjects within zone \(z\), and \({\theta }_{\mathrm{o}\mathrm{u}\mathrm{t}}\) is the mean survival time for subjects outside zone \(z\). The exponential-based spatial scan statistic is defined as

$$\mathrm{\lambda }=\frac{\underset{z}{\mathrm{max}}{\left(\frac{{r}_{\mathrm{i}\mathrm{n}}}{\sum _{i\in z}{t}_{i}}\right)}^{{r}_{\mathrm{i}\mathrm{n}}}{\left(\frac{{r}_{\mathrm{o}\mathrm{u}\mathrm{t}}}{\sum _{i\notin z}{t}_{i}}\right)}^{{r}_{\mathrm{o}\mathrm{u}\mathrm{t}}}}{{\left(\frac{R}{\sum _{i\in G}{t}_{i}}\right)}^{R}}$$

where \({r}_{\mathrm{i}\mathrm{n}}=\sum _{i\in z}{\delta }_{i}\) and \({r}_{\mathrm{o}\mathrm{u}\mathrm{t}}=\sum _{i\notin z}{\delta }_{i}\) (the number of non-censored subjects inside and outside zone \(z\), respectively). The total number of non-censored subjects in the entire study area \(G\) is denoted by \(R={r}_{\mathrm{i}\mathrm{n}}+{r}_{\mathrm{o}\mathrm{u}\mathrm{t}}.\)When there are no censored observations, \({r}_{\mathrm{i}\mathrm{n}}\) and \({r}_{\mathrm{o}\mathrm{u}\mathrm{t}}\) are replaced by the total number of subjects inside and outside zone \(z\), \({n}_{\mathrm{i}\mathrm{n}}\) and \({n}_{\mathrm{o}\mathrm{u}\mathrm{t}}\), respectively, with \(R\) by \(N={n}_{\mathrm{i}\mathrm{n}}+{n}_{\mathrm{o}\mathrm{u}\mathrm{t}}\) in the above test statistic.

When searching for clusters of short survival time using SaTScanâ„¢, users can specify the maximum size for z. The default setting is 50% of the total population. When the size of the most likely cluster is very large, one may want to know if smaller clusters that are statistically significant are contained in the large cluster. We can try different values for the maximum reported cluster size (MRCS), not the maximum scanning window size (MSWS). The MRCS is also set to 50% of the total population by default. It is not clear how to select the best MRCS for the exponential model. In the next section, we propose a Gini coefficient to optimize the MRCS for the exponential model.

Gini coefficient for exponential model

To measure the disproportion of survival in each area, the Lorenz curve can be defined using the number of subjects and the sum of survival times. We define the x-axis as the cumulative proportion of the number of non-censored subjects and the y-axis as the cumulative proportion of the sum of observed times. If there is only one significant cluster \({z}^{*},\) the Lorenz curve is constructed in the same way as that of the Poisson model. Specifically, the x- and y-coordinates of point P for the cluster are defined as:

$${x}_{1}=\frac{\sum _{i\in {z}^{*}}{\delta }_{i}}{\sum _{i\in G}{\delta }_{i}}\left(=\frac{{r}_{\mathrm{i}\mathrm{n}}}{R}\right)$$

and

$${y}_{1}=\frac{\sum _{i\in {z}^{*}}{t}_{i}}{\sum _{i\in G}{t}_{i}}.$$

Considering the maximum likelihood estimates for the parameter \(\theta\) of the exponential distribution under the null and alternative hypotheses, that is, \({\widehat{\theta }}_{0}=R/\sum _{i\in G}{t}_{i}\) and \({\widehat{\theta }}_{in}={r}_{in}/\sum _{i\in z}{t}_{i}\), the cumulative proportion of the sum of the observed times would be proportional to the cumulative proportion of non-censored subjects in each region under the null hypothesis of no clusters. If there is a significant cluster \({z}^{*}\) of short survival, the proportion of the sum of observed times in the cluster to that in the whole study region \(G\) would be less than the proportion of the number of subjects. As the sum of the observed times in the cluster \({z^*}\) decreases, the y-coordinate \({y}_{1}\) decreases and the Lorenz curve moves farther away from the reference line. Then, the value of the Gini coefficient, which is twice the area between the Lorenz curve and the reference line, increases. When there are K clusters \(z_1^*,\; \ldots ,\;z_K^*\) (ordered by their statistical significance), the coordinates of each cluster \(({x}_{k},{y}_{k})\) are defined as \({x}_{k}=\sum _{i\in \left\{{\bigcup }_{j=1}^{k}{z}_{j}^{*}\right\}}{\delta }_{i}/R\) and \({y_k} = \mathop \sum \nolimits_{i \in \left\{{\bigcup }_{j=1}^{k} {z}_{j}^{*}\right\}} {t_i}/\mathop \sum \nolimits_{i \in G} {t_i}\). The Lorenz curve connects K points of \(({x}_{k},{y}_{k})\), and the Gini coefficient is calculated in the same way as \({\sum }_{k=1}^{K+1}({y}_{k}{x}_{k-1}-{y}_{k-1}{x}_{k})\) with \({x}_{0}={y}_{0}=0\) and \({x}_{K+1}={y}_{K+1}=1.\) Different values for the MRCS produces different sets of clusters with different values of the Gini coefficient. We can select the optimal collection of clusters with the highest dissimilarity in survival based on the Gini coefficient.

Simulation study

We conducted a simulation study to evaluate the performance of the Gini coefficient in the exponential model. We used six cluster models in Seoul and Gyeonggi Province in South Korea as the whole study region. True clusters of different shapes and sizes are assumed in the study region, consisting of 67 districts, as shown in Fig. 1. Since circular and elliptical windows are available in SaTScan™, we mainly considered these two shapes. We also included an irregularly shaped cluster to examine whether the proposed method could possibly work better in identifying irregular clusters than the default setting. Cluster models A and B assumed a circular true cluster of 10% (6 districts) and 30% (20 districts) of the entire study region, respectively. Cluster model C included two adjacent circular clusters, each of which accounts for 10% (6 districts). Models D and E consisted of elliptical clusters of 10% (6 districts) and 30% (20 districts). Model F included an irregularly shaped cluster of 20% (13 districts). For each model, we considered 12 scenarios for the combination of mean survival time and censoring rate. We varied the mean survival time for the true clusters as 2, 5, and 7, compared to 10 for areas outside the clusters. We adopted the parameter setting for the mean survival time from the study by Huang et al. [4]. The censoring rates were set to 10%, 30%, 50%, and 70% to examine how the performance of the proposed method can be affected by the censoring rate.

Fig. 1
figure 1

Cluster models used in the simulation. A one circular cluster of 10%, B one circular cluster of 30%, C two circular clusters of 10% each, D one elliptical cluster of 10%, E one elliptical cluster of 30%, F one irregular cluster of 20%

We generated 1,000 subjects and randomly assigned them to one of the 67 districts in the study region under each scenario. If a subject was in the districts of the true cluster, the survival time was generated from an exponential distribution with a mean of 2, 5, and 7. Otherwise, the survival time was generated from an exponential distribution with a mean of 10. We censored the survival time for randomly selected subjects out of the 1,000 subjects at a chosen censoring rate. We then searched for clusters with short survival using circular and elliptical scanning windows, with 15 MRCS values of 3%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50% in the SaTScanâ„¢ software. Using these numbers can be thought of as a grid search. These candidate MRCS values are used for the Poisson and Bernoulli models in SaTScanâ„¢ and were used for consistency with the exponential model. We selected these numbers to be consistent for the exponential model as used in the Poisson and Bernoulli models in SaTScanâ„¢. The MSWS was fixed at 50%. The Gini coefficient was calculated for each MRCS value. We selected the optimal MRCS with the highest Gini coefficient. The reported clusters were then compared with the true clusters.

We repeated the simulation 1,000 times for each scenario. We counted the number of times the Gini coefficient selected each of the 15 MRCS values as the optimal. The performance of the proposed method was summarized using the sensitivity and positive predicted value (PPV). In the context of spatial cluster detection, sensitivity is the proportion of districts correctly detected among the districts in the true cluster, and PPV is the proportion of districts correctly detected among the districts in the detected cluster. Higher values of these measures indicate more accurate detection. Specifically, the sensitivity and PPV were estimated from 1,000 datasets as

$${\text{Sensitivity}} = \frac{1}{S}\mathop \sum \limits_{s = 1}^S \frac{{number\;of\;districts\;correctly\;detected\;}}{{number\;of\;districts\;in\;the\;true\;cluster\;}}$$
$${\text{PPV}} = \frac{1}{S}\mathop \sum \limits_{s = 1}^S \frac{{number\;of\;districts\;correctly\;detected\;}}{{number\;of\;detected\;districts\;}}$$

where \(S\) is the number of rejected datasets. We also calculated the accuracy measures under the default MRCS setting of 50% in SaTScanâ„¢.

Korea community health survey data

To illustrate the proposed method, we used data from the Korea Community Health Survey (KCHS) conducted by the Korea Centers for Disease Control and Prevention [21]. This community-based cross-sectional survey has been conducted at 253 community health centres annually since 2008. The survey questionnaire includes topics related to health behaviour and prevention. We used the age of first drinking for males as the survival time in the 2017 survey data. If a person had never had a drink, his survival time was censored at the age of the survey. The location information of each individual was available at the district level because each district in Korea has approximately one community health centre. In Seoul and Gyeonggi province, we searched for areas with low mean age of first drinking (i.e. spatial clusters of short survival time) using the exponential-based spatial scan statistic with both circular and elliptical scanning windows. The reported clusters selected optimally by the proposed method were compared with those at the default setting in SaTScanâ„¢.

Results

Simulation study results

Here, we have presented only a subset of all the simulation results. The other results are included in Additional file 1. Tables 1 and 2 show that the Gini coefficient most often selected the optimal MRCS as the same as the size of the true cluster using circular or elliptical windows when the true cluster was circular with a mean survival time of 5, regardless of the censoring rate. The detection accuracy was very high for the most frequently chosen MRCS. Both the sensitivity and PPV were above 0.95, which is higher than those at the default setting in most cases. The difference in the detection accuracy between the most often chosen MRCS and the default setting was larger when the true cluster was smaller (10%). The difference in PPV was even more pronounced. When the true cluster was medium sized (30%), the PPV was higher in every case at the most often chosen MRCS, while the sensitivity was slightly higher or similar. These results indicate that the spatial scan statistic without optimizing the MRCS tends to report a larger cluster than the true cluster, especially when the true cluster is small. A lower PPV implies that the detected cluster is larger because the number of detected clusters is in the denominator when calculating the PPV. We also summarized the overall detection accuracy when using the Gini coefficient over all the chosen MRCSs. Still, the sensitivity and PPV were higher than or similar to those at the default setting.

Table 1 Simulation results for cluster model A (one circular cluster, 10% of total area) with a mean survival time of 5
Table 2 Simulation results for cluster model B (one circular cluster, 30% of total area) with a mean survival time of 5

In the case of two true clusters, which are close to each other, the proposed method often chose a slightly smaller MRCS than that of the true cluster. However, the PPV was always higher than that at the default setting, although the sensitivity was slightly lower only when the mean survival time in the true clusters was 5. This result again implied that the default setting reported rather a larger cluster than the true clusters. When the mean survival time was 7 in the true clusters, the frequency of chosen MRCS was spread over all possible MRCSs (Table 3). This might be attributable to the low detection power due to the small difference in mean survival time inside vs. outside the clusters. The promising indication here is that the overall sensitivity is much higher when using the Gini coefficient than without it.

Table 3 Simulation results for cluster model C (two circular clusters, 10% each of total area) with a mean survival time of 7

In the case of elliptical clusters, the Gini coefficient with elliptical scanning windows most often picked the best MRCS of the same size as the true cluster when the mean survival time was 5 inside the true cluster (Tables 4 and 5). When the cluster was small (10%), the detection accuracy at the most often chosen MRCS was much higher than that at the default setting. When the mean survival time was 2 inside the true cluster, similar patterns were observed. The Gini coefficient with circular scanning windows most often selected a smaller MRCS than the true cluster size. Still, the overall sensitivity and PPV at the most often chosen MRCS were higher than those at the default setting. When the mean survival time was 7 inside the true cluster, the overall detection accuracy was higher than that at the default setting.

Table 4 Simulation results for cluster model D (one elliptical cluster, 10% of total area) with a mean survival time of 5
Table 5 Simulation results for cluster model E (one elliptical cluster, 30% of total area) with a mean survival time of 5

When the true cluster was irregularly shaped, the proposed method seemed to choose smaller sizes of MRCS than the true cluster size. However, the overall sensitivity was always higher than that at the default setting. When the mean survival time was 7 in the true cluster, the difference in performance was clearer (Table 6). This might be because refined sets of smaller clusters were reported by the Gini coefficient rather than a single larger cluster.

Table 6 Simulation results for cluster model F (one irregular cluster, 20% of total area) with a mean survival time of 7

KCHS data analysis results

When using circular windows, the proposed method selected the default setting of 50% as the optimal MRCS. The most likely cluster was quite large, including 31 districts, as shown in Fig. 2(a). A small secondary cluster consisting of three districts was also detected. When using elliptical windows, the proposed method selected 25% as the optimal MRCS. The detected clusters were slightly different from those at the default setting. Information on the detected clusters is presented in Table 7. A single large cluster consisting of 26 districts was detected at the default setting (Fig. 2(c)), while two smaller clusters were detected using the Gini coefficients (Fig. 2(b)). Cluster 1 in Fig. 2(b) is part of cluster 1 in Fig. 2(c). Some districts of cluster 2 in Fig. 2(b) overlapped with cluster 1 in Fig. 2(c), but the other districts were not included in the cluster in Fig. 2(c). The test statistic value for the cluster in Fig. 2(c) was much larger than that for cluster 1 in Fig. 2(b). However, the mean survival time of cluster 1 in Fig. 2(b) was lower than that of cluster 1 in Fig. 2(c). It is likely that the default setting detected a larger cluster by including unnecessary neighbouring districts. Although the mean survival time of cluster 2 in Fig. 2(b) was higher than that of cluster 1 in Fig. 2(c), it was still lower than that outside the clusters and is statistically significant. The clusters at the optimal MRCS chosen by the Gini coefficient in Fig. 2(b) appear to be more meaningful than cluster 1 in Fig. 2(c).

Fig. 2
figure 2

Spatial clusters with low mean age of first drinking in Seoul and Gyeonggi province using 2017 KCHS data. a circular windows, Gini or default (50%), b elliptical windows, Gini (25%), c elliptical windows, default (50%)

Table 7 Cluster detection results for 2017 KCHS data using elliptical windows with the Gini coefficient and default setting for MRCS

Discussion and conclusion

We have proposed the Gini coefficient in the exponential-based spatial scan statistic to optimize the MRCS in cluster detection analysis for survival data. The proposed method was defined to measure the degree of heterogeneity in the mean survival times of clusters. Our simulation study showed that the Gini coefficient mostly selected the optimal MRCS, similar to the true cluster size. The detection accuracy was higher for the best chosen MRCS than at the default setting. A lower PPV at the default setting indicates that using the default value of 50% of the total population for the MSWS and MRCS tends to produce a larger cluster that hides smaller clusters and includes non-informative areas. Even though the Gini coefficient did not always select the optimal MRCS the same as the true cluster size, the overall detection accuracy when using the Gini coefficient was generally improved compared to when it was not used. This improvement was greatly noticeable in some cases.

The application of the proposed method to the KCHS data supported that the proposed method can optimize the MRCS in spatial cluster detection analysis for survival data. We searched for a cluster with a short survival time. The most likely cluster at the default setting was rather larger with a higher mean survival time than that at the optimal MRCS chosen by the Gini coefficient. Interestingly, the two clusters at the optimal MRCS were contiguous and formed an irregularly shaped cluster. As reported by Kim and Jung [12], the Gini coefficient might also be useful for detecting irregularly shaped clusters in the exponential model.

Here, we again emphasize that we optimize the MRCS using the Gini coefficient, not the MSWS. Rerunning the analyses with different MSWSs should be avoided because of the multiple testing problem. Wang et al. [22] presented their proposed method, called the maximum clustering heterogeneous set proportion, as an indicator to select the MSWS. As they described, different MSWSs lead to different sets of windows and then different detected clusters. Thus, even the same cluster under different sets of windows can have different p-values. It is incorrect to choose the result with the smallest p-value because it is not appropriately adjusted for multiple testing. Trying different values of MRCS to select clusters for reporting is the correct way to do this.

The Gini coefficient was first developed for the Poisson and Bernoulli models and subsequently adopted for the ordinal and normal-based models. The Gini coefficient for the exponential model in this study was also specifically defined for the specific probability model and thoroughly evaluated. The option to optimize the MRCS using the Gini coefficient in SaTScan™ is available only for the Poisson and Bernoulli models. It is easy to implement the Gini coefficient in the exponential model using R with the ‘rsatscan’ package[23]. An R function to calculate the Gini coefficient is available upon request.

Using the spatial scan statistic with the default setting has been criticized because the detected most likely cluster may be much larger than the true clusters as they might include irrelevant neighbouring areas [24,25,26,27]. Studies that proposed the Gini coefficient for the Poisson, Bernoulli, ordinal, and normal models revealed that using the Gini coefficient in spatial scan statistics can resolve this problem to a certain extent [11, 13, 14]. Using the Gini coefficient for the Poisson model can also be effective in detecting irregularly shaped clusters [12]. The exponential model can be used for spatial cluster detection analysis of time-to-event type data such as cancer survival, time to disease recurrence, or age at first smoking, with or without censoring. We believe that using the Gini coefficient in the exponential-based spatial scan statistic can be very helpful for reporting more refined and informative clusters for survival data.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

MRCS:

Maximum reported cluster size

MSWS:

Maximum scanning window size

KCHS:

Korea Community Health Survey

References

  1. Kulldorff M. A spatial scan statistic. Commun Statistics Theory Meth. 1997;26:1481–96.

    Article  Google Scholar 

  2. Jung I, Kulldorff M, Klassen AC. A spatial scan statistic for ordinal data. Stat Med. 2007;26:1594–607.

    Article  Google Scholar 

  3. Jung I, Lee H. Spatial cluster detection for ordinal outcome data. Stat Med. 2012;31:4040–8.

    Article  Google Scholar 

  4. Huang L, Kulldorff M, Gregorio D. A spatial scan statistic for survival data. Biometrics. 2007;63:109–18.

    Article  Google Scholar 

  5. Kulldorff M, Huang L, Konty K. A scan statistic for continuous data based on the normal probability model. Int J Health Geogr. 2009;8:1.

    Article  Google Scholar 

  6. Huang L, Tiwari RC, Zou Z, et al. Weighted normal spatial scan statistic for heterogeneous population data. J Am Stat Assoc. 2009;104:886–98.

    Article  CAS  Google Scholar 

  7. Jung I, Cho HJ. A nonparametric spatial scan statistic for continuous data. Int J Health Geogr. 2015;14:30.

    Article  Google Scholar 

  8. Jung I, Kulldorff M, Richard OJ. A spatial scan statistic for multinomial data. Stat Med. 2010;29:1910–8.

    Article  Google Scholar 

  9. Kulldorff M. and Information Management Services, Inc. SaTScanTM v9.7: Software for the spatial and space-time scan statistics. https://www.satscan.org/, 2021.

  10. Ribeiro SHR, Costa MA. Optimal selection of the spatial scan parameters for cluster detection: a simulation study. Spatial Spatio Temporal Epidemiol. 2012;3:107–20.

    Article  Google Scholar 

  11. Han J, Zhu L, Kulldorff M, et al. Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics. Int J Health Geogr. 2016;15:27.

    Article  Google Scholar 

  12. Kim J, Jung I. Evaluation of the Gini coefficient in spatial scan statistics for detecting irregularly shaped clusters. PLoS ONE. 2017;12:e0170736.

    Article  Google Scholar 

  13. Kim S, Jung I. Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data. PLoS ONE. 2017;12:e0182234.

    Article  Google Scholar 

  14. You H, Jung I. Optimizing the maximum reported cluster size for normal-based spatial scan statistics. Commun Statistical Appls Methods. 2018;25:373–83.

    Article  Google Scholar 

  15. Gastwirth JL. The estimation of the Lorenz curve and Gini index. Rev Econ Stat. 1972;54:306–16.

    Article  Google Scholar 

  16. Bhatt V, Tiwari N. A spatial scan statistic for survival data based on Weibull distribution. Stat Med. 2014;33:1867–76.

    Article  Google Scholar 

  17. Bhatt V, Tiwari N. A spatial scan statistic for survival data based on generalized life distribution. Commun Statistics Theory Methods. 2016;45:5730–44.

    Article  Google Scholar 

  18. Huang L, Pickle LW, Stinchcomb D, et al. Detection of spatial clusters: Application to cancer survival as a continuous outcome. Epidemiology. 2007;18:73–87.

    Article  Google Scholar 

  19. Henry KA, Niu X, Boscoe FP. Geographic disparities in colorectal cancer survival. Int J Health Geogr. 2009;8:48.

    Article  Google Scholar 

  20. Lin Y, Schootman M, Zhan FB. Racial/ethnic, area socioeconomic, and geographic disparities of cervical cancer survival in Texas. Appl Geogr. 2015;56:21–8.

    Article  Google Scholar 

  21. Kang YW, Ko YS, Kim YJ, et al. Korea Community Health Survey Data Profiles. Osong Public Health Res Perspectives. 2015;6:211–7.

    Article  Google Scholar 

  22. Wang W, Zhang T, Yin F, et al. Using the maximum clustering heterogeneous set-proportion to select the maximum window size for the spatial scan statistic. Sci Rep. 2020;10:4900.

    Article  CAS  Google Scholar 

  23. Kleinman, Ken. Rsatscan: Tools, Classes, and Methods for Interfacing with SaTScan Stand-Alone Software. https://CRAN.R-project.org/package=rsatscan/, 2015.

  24. Tango T. A test for spatial disease clustering adjusted for multiple testing. Stat Med. 2000;19:191–204.

    Article  CAS  Google Scholar 

  25. Tango T, Takahashi K. A flexibly shaped spatial scan statistic for detecting clusters. Int J Health Geogr. 2005;4:11.

    Article  Google Scholar 

  26. Tango T. A spatial scan statistic with a restricted likelihood ratio. Japanese J Biometrics. 2008;29:75–95.

    Article  Google Scholar 

  27. Tango T. Spatial scan statistics can be dangerous. Stat Methods Med Res. 2021;30:75–86.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

IJ conceived the study and drafted the manuscript. SL and JM conducted the simulation and data analysis. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Inkyung Jung.

Ethics declarations

Ethics approval and consent to participate

This study was approved by SNU Research Ethics Team (IRB No. E1912/001–010).

Consent for publication

Not applicable.

Competing interests

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Addtional file 1: Table A1.

Simulation results for cluster model A (one circular cluster, 10% of total area) with a mean survival time of 2. Table A2. Simulation results for cluster model A (one circular cluster, 10% of total area) with a mean survival time of 7. Table A3. Simulation results for cluster model B (one circular cluster, 30% of total area) with a mean survival time of 2. Table A4. Simulation results for cluster model B (one circular cluster, 30% of total area) with a mean survival time of 7. Table A5. Simulation results for cluster model C (two circular clusters, 10% each of total area) with a mean survival time of 2. Table A6. Simulation results for cluster model C (two circular clusters, 10% each of total area) with a mean survival time of 5. Table A7. Simulation results for cluster model D (one elliptic cluster, 10% of total area) with a mean survival time of 2. Table A8. Simulation results for cluster model D (one elliptic cluster, 10% of total area) with a mean survival time of 7. Table A9. Simulation results for cluster model E (one elliptic cluster, 30% of total area) with a mean survival time of 2. Table A10. Simulation results for cluster model E (one elliptic cluster, 30% of total area) with a mean survival time of 7.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, S., Moon, J. & Jung, I. Optimizing the maximum reported cluster size in the spatial scan statistic for survival data. Int J Health Geogr 20, 33 (2021). https://doi.org/10.1186/s12942-021-00286-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12942-021-00286-w

Keywords