Research Paper
Role of municipal database in constructing site-specific multivariate probability distribution

https://doi.org/10.1016/j.compgeo.2020.103623Get rights and content

Abstract

The purpose of this paper is to investigate whether the quasi-site-specific model established based on a municipal or regional database can be more effective in supporting site-specific predictions than one based on a global database. To do this, a Bayesian updating approach that combines site-specific data and a municipal database under realistic MUSIC (Multivariate, Uncertain and Unique, Sparse, and InComplete) attributes for the site data is adopted. This approach assumes that the site data and the municipal database follow the same distribution, which can be partially verified using standard correlation plots. The quasi-site-specific model can then be adopted to make site-specific predictions for the relevant design properties. Because real soil data follow MUSIC attributes, this Bayesian updating is only feasible in the presence of a probability distribution construction method that can handle such data. A real case study for Shanghai shows that a municipal database is more effective in supporting site-specific predictions than a global database. In contrast, another real case study shows that a regional database that covers multiple countries in the Scandinavian region is not necessarily more effective than a global database.

Introduction

There are two relatively unique features of geotechnical site investigation data that could be juxtaposed in the context of data-driven decision making. On one hand, when it comes to a local site, site-specific data are typically sparse and incomplete. For a routine site investigation program, a few boreholes and some field tests such as cone penetration test (CPT) soundings may be conducted. For the boreholes, only a limited number of depths are measured or a limited number of samples are collected (sparsity in the vertical direction), whereas for both boreholes and CPT soundings, only a limited number of locations are sampled (sparsity in the horizontal direction). The volume of the investigated soil mass is very small compared with the total soil volume mobilized by the actual structure. Also, only a limited number of laboratory tests can be conducted on each soil sample, because the majority are destructive in nature. It follows that local correlations among different soil properties, e.g., Atterberg limits, SPT N, preconsolidation stress, undrained shear strength, etc., are difficult to establish, because this multivariate information is sparse and rarely complete (complete means that all soil properties are simultaneously measured in a particular location and at a particular depth). This is well known in practice. In fact, generic correlations founded on data from multiple global sites are common for this reason [17]. It is accurate to say that we are in a data-poor scenario for a local site. The definition of “poor” or “rich” is related to the statistical quantities that one is interested to evaluate, not some absolute number of samples as conventional wisdom may lead one to believe. For site characterization, we are interested in characterizing spatial variations (the most common is the autocorrelation function) and dependency between different properties (the most common is the cross-correlation). We describe our geotechnical site investigation as data “poor” at one site, because data are typically insufficient to characterize these statistical quantities with confidence. Phoon [20] described this situation as MUSIC: site-specific data are Multivariate, Uncertain and Unique, Sparse, and InComplete. Geotechnical engineers routinely need to make decisions based on MUSIC site-specific data. It is useful to point out that no existing method can address MUSIC site-specific data in full. To our knowledge, the probability distribution construction method proposed by Ching and Phoon [9] that can handle all the attributes in MUSIC is the first of its kind in geotechnical engineering. An example of a decision could be the selection of the characteristic value of a design property such as the undrained shear strength. The average value is not sufficient, because it is not a “cautious” estimate. This estimate should be a reasonable lower bound value that depends on the degree of uncertainty in the design property within the mobilized volume of soil relevant to a limit state. The standard indicator of uncertainty is the coefficient of variation which cannot be estimated with any confidence when the sample size is small. Phoon et al. [22] venture to suggest that the acronym MUSIC can be further extended to cover corrupted data (outliers): Multivariate, Uncertain and Unique, Sparse, Incomplete, and potentially Corrupted. Ching et al. [11] proposed a simple chi-square approach to identify outliers in a rock property database. To our knowledge, there is no method that can handle MUSIC data in this expanded form.

On the other hand, it is widely known that generic (non-site-specific) data are abundant. Phoon et al. [22] coined the phrase “Big Indirect Data” (BID) to emphasize that the common perception of data sparsity in geotechnical engineering is only accurate within a site-specific context. Indirect data arising from sites outside of the project boundary can range from irrelevant to relevant, but one can imagine abundance of the order of tens of thousands of soil records at a municipal/regional scale. The municipal database for Shanghai discussed below is an example of BID. It is quite extreme to ignore BID – the implied assumption that all non-site-specific data are irrelevant is not aligned to current practice. An experienced engineer will consider data from comparable sites but he/she is unlikely to find time to trawl tens of thousands of potentially useful soil records systematically. Hence, comparable sites are mostly restricted to those within his/her experience base likely to be restricted to a few municipalities/regions. Ultimately, if data were to continue to grow in volume and complexity, a pure human judgment approach is not a winnable strategy. The “site challenge” is how to complement current practice steeped in empiricism with data-driven methods to extract maximum value from BID for decision making. To put this “site challenge” in a simpler way, can one replicate the experience base of a human engineer by training an algorithm with all datasets worldwide (and subject it to re-training as new data emerge) that is sensitive to the unique features of each site for each property of interest? While engineering judgment remains pivotal in decision making, it is ineffective in dealing with the volume, variety, velocity, and veracity typically associated with big data, much less the complex inter-relationships between BID and MUSIC. The Bayesian updating approach adopted in the current paper is an example of combining BID (municipal database for Shanghai) and MUSIC (singe site data) to provide more insights to inform judgment, but other strategies exist that can bring more value to site data than conventional regression analyses [21]. There is no doubt that human judgment can be significantly enhanced by such a data-driven approach that will contribute to the digitalization of geotechnical engineering.

The conventional approach for estimating a soil property from a measurement (say undrained shear strength from cone tip resistance) is regression. The underlying model is a bivariate probability distribution model. It is possible to estimate a soil property using multiple measurements (say undrained shear strength from cone tip resistance, overconsolidation ratio, and liquidity index). The supporting model would be a multivariate probability distribution model. The current practice is to characterize this probability distribution model using a global database (BID). However, global data are indirect, so the resulting model may not be precise for a particular target site [8]. Specifically, its average trend may not be the same as the local trend. The more ideal practice is to characterize a local model, but given the typical data-poor site-specific scenario, it is technically difficult to quantify the statistical uncertainty, which can be significant and cannot be neglected. Recently, Ching and Phoon [9] proposed a method of constructing a quasi-site-specific multivariate probability distribution model by hybridizing a site-specific model (based on MUSIC data) with a global model (based on BID). Hybridization rather than Bayesian updating is adopted in Ching and Phoon [9] to combine the site-specific and global data because the authors hypothesized that these databases follow different mean vectors and covariance matrices. The physical rationale is that site-specific and global data usually exhibit different correlation trends.

The current paper considers a different scenario: the database is now a municipal or regional database rather than global. The scenarios covered by municipal, regional, and global databases are described as follows:

  • 1.

    A municipal database contains only data from sites located in the same town/city. The Shanghai municipal database presented later is an example. This database contains only data from Shanghai sites.

  • 2.

    A regional database contains only data from sites located in the same geographical region. A geographical region refers to a continuous area that contains multiple towns, cities, provinces, or even countries. The Scandinavian regional database presented later is an example. This database contains only data from Norway, Sweden, and Finland sites.

  • 3.

    A global database contains data from sites located in multiple geographical regions. The CLAY/10/7490 database presented later is an example. This database contains data from sites in 30 countries in six continents.

The word “local” refers to a single target site in any database. A “municipal-local” scenario refers to the scenario where the database is municipal (Shanghai in this study) and the target site of interest is one site in Shanghai. This scenario is not uncommon: design offices or government regulatory bodies may have municipal soil data accumulated from past projects. In this scenario, it may be reasonable to assume that the site-specific data and the municipal data follow the “same” mean vector and covariance matrix. This can be partially verified using, for instance, the su/σ′v versus σ′v/Pa correlation plots such as those shown in Fig. 1 (su = undrained shear strength, σ′v = vertical effective stress, and Pa = 101.3 kPa = one atmosphere pressure). It can be seen that only a database at the municipal scale (Fig. 1c) may satisfy the population homogeneity assumption. Such correlation plots are routinely prepared in practice and provide the engineer a separate verification of homogeneity besides relying on a qualitative description of the database, be it municipal, regional or global.

Under this homogeneity assumption, the Bayesian updating method can be adopted to combine the site-specific data and municipal data to obtain the quasi-site-specific multivariate probability distribution model. In the first step of this Bayesian method, the municipal database is used to construct the municipal probability model, and this model is adopted as the prior model for the second step. It is worth noting that this step also guarantees the correlation matrix arising from the database to be positive definite in a rigorous way. Past attempts to construct a multivariate distribution for generic databases are stymied by non-positive definite correlation matrices. Thus far, strategies to fix this important theoretical error are ad-hoc (e.g. [6]). Beer et al. [1] developed an algorithm to construct a valid correlation matrix from one established from expert estimates. In the second step, the prior model is updated into the posterior model (namely, the quasi-site-specific model) by the site-specific data.

The main purpose of the current paper is to investigate whether the quasi-site-specific model established based on a municipal or regional database can be more effective in supporting site-specific predictions than one based on a global database. For this purpose, one real case study for Shanghai is investigated, and the performance of the quasi-site-specific model for the municipal database is compared with that for a global database. The reasons why the Shanghai municipal database is more effective are explored. Another case study for the Scandinavian region is further investigated to explore whether a regional database can be more effective than a global database.

Section snippets

Municipal database for shanghai

This study adopts a clay municipal database previously developed by Zhang et al. [23], [24] for 13 clay sites in Shanghai, China. The locations of the 13 sites are shown in Fig. 2a. Each site contains 1 to 3 boreholes. The clay is normally consolidated to lightly over-consolidated, with medium to high plasticity (plasticity index = 10–30), and with medium to high sensitivity (sensitivity = 1.5–8). More than 10 clay parameters are compiled in this Shanghai municipal database labeled as

Bayesian updating method

Because the current paper considers a scenario different from that considered in Ching and Phoon [9], the analysis method used in the current paper is also different.

Case study

The Shanghai case history is investigated. As mentioned earlier, the data from 12 Shanghai sites with 22 boreholes are adopted as the municipal database, whereas the remaining 1 Shanghai site is adopted as the target site. Steps 1 to 2 of the Bayesian updating method are demonstrated below.

Conclusion

This paper investigates the effectiveness of a municipal database in supporting site-specific predictions. The quasi-site-specific multivariate probability model is constructed by the target-site data supported by a municipal database. Based on a case study of Shanghai, it is found that the municipal clay database is more effective in supporting site-specific predictions than a global clay database (CLAY/10/7490) previously developed by the first and second authors. The effectiveness for the

CRediT authorship contribution statement

Jianye Ching: Conceptualization, Methodology, Software, Verification, Formal analysis, Writing - Original draft preparation, Visualization, Supervision, Funding acquisition. Kok-Kwang Phoon: Conceptualization, Writing - Review & Editing, Supervision. Zahle Khan: Software, Verification, Visualization. Dongming Zhang: Resources, Data curation, Visualization. Hongwei Huang: Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The first author would like to thank the gracious support from the Ministry of Science and Technology of Taiwan (106-2221-E-002-084-MY3 and 107-2221-E-002-053-MY3). The authors would like to thank the members of the TC304 Committee on Engineering Practice of Risk Assessment & Management of the International Society of Soil Mechanics and Geotechnical Engineering for developing the database 304 dB (http://140.112.12.21/issmge/Database_2010.htm) used in this study and making it available for

References (24)

  • S. Liu et al.

    Multivariate correlation among resilient modulus and cone penetration test parameters of cohesive subgrade soils

    Eng Geol

    (2016)
  • Beer M, Gong ZT, Diaz De La O FA, Kreinovich V. How accurate are expert estimations of correlation? In Proc. 2017 IEEE...
  • L. Bjerrum

    Embankments on soft ground

  • Z.I. Botev et al.

    Ann Stat

    (2010)
  • CECS04-88. Technical Specification for Electrical Cone Penetration Test (China Committee for Engineering Construction...
  • J. Ching et al.

    Transformations and correlations among some parameters of clays – the global database

    Can Geotech J

    (2014)
  • J. Ching et al.

    Correlations among some clay parameters–the multivariate distribution

    Can Geotech J

    (2014)
  • J. Ching et al.

    Constructing multivariate distributions for soil parameters

  • J. Ching

    What does the soil parameter estimated from a transformation model really mean?

    J GeoEng

    (2018)
  • J. Ching et al.

    Constructing site-specific probabilistic transformation model by Bayesian machine learning

    ASCE J Eng Mech

    (2019)
  • J. Ching et al.

    Constructing a site-specific multivariate probability distribution using sparse, incomplete, and spatially variable (MUSIC-X) data

    ASCE J Eng Mech

    (2020)
  • J. Ching et al.

    Multivariate probability distribution for some intact rock properties

    Can Geotech J

    (2019)
  • Cited by (13)

    • Quasi-site-specific prediction of shear wave velocity from CPTu

      2023, Soil Dynamics and Earthquake Engineering
    • Challenges in geotechnical design revealed by reliability assessment: Review and future perspectives

      2022, Soils and Foundations
      Citation Excerpt :

      This is probably one of the most critical issues to consider when considering how to handle transformation errors in reliability analysis. Ching and Phoon (Ching and Phoon (2019), Ching and Phoon (2020b) and Ching et al. (2020)) proposed an excellent data-driven Bayesian machine learning method (this method is commonly referred to as “MUSIC”) for constructing site-specific multivariate probability distribution in GE. This proposed method is based on a hybrid of generic data and site-specific data.

    • Pavement structure: optimal and reliability-based design

      2022, Risk, Reliability and Sustainable Remediation in the Field of Civil and Environmental Engineering
    • Multivariate probability distribution of Shanghai clay properties

      2020, Engineering Geology
      Citation Excerpt :

      SH-CLAY/11/4051 presents the most challenging database for research addressing the “U” in MUSIC data in contrast to global or regional databases, because all sites are broadly similar due to their common geologic origin. In addition, SH-CLAY/11/4051 is a valuable contribution to Shanghai geotechnical engineering practice, because the estimation of design parameters is less uncertain if SH-CLAY/11/4051 and site-specific data are considered together, rather than using highly limited site-specific data alone (Ching et al., 2020). The rest of this study is organized as follows.

    View all citing articles on Scopus
    View full text