Elsevier

Earth-Science Reviews

Volume 207, August 2020, 103225
Earth-Science Reviews

Machine learning methods for landslide susceptibility studies: A comparative overview of algorithm performance

https://doi.org/10.1016/j.earscirev.2020.103225Get rights and content

Abstract

Landslides are one of the catastrophic natural hazards that occur in mountainous areas, leading to loss of life, damage to properties, and economic disruption. Landslide susceptibility models prepared in a Geographic Information System (GIS) integrated environment can be key for formulating disaster prevention measures and mitigating future risk. The accuracy and precision of susceptibility models is evolving rapidly from opinion-driven models and statistical learning toward increased use of machine learning techniques. Critical reviews on opinion-driven models and statistical learning in landslide susceptibility mapping have been published, but an overview of current machine learning models for landslide susceptibility studies, including background information on their operation, implementation, and performance is currently lacking. Here, we present an overview of the most popular machine learning techniques available for landslide susceptibility studies. We find that only a handful of researchers use machine learning techniques in landslide susceptibility mapping studies. Therefore, we present the architecture of various Machine Learning (ML) algorithms in plain language, so as to be understandable to a broad range of geoscientists. Furthermore, a comprehensive study comparing the performance of various ML algorithms is absent from the current literature, making an assessment of comparative performance and predictive capabilities difficult. We therefore undertake an extensive analysis and comparison between different ML techniques using a case study from Algeria. We summarize and discuss the algorithm's accuracies, advantages and limitations using a range of evaluation criteria. We note that tree-based ensemble algorithms achieve excellent results compared to other machine learning algorithms and that the Random Forest algorithm offers robust performance for accurate landslide susceptibility mapping with only a small number of adjustments required before training the model.

Introduction

Landslides are a cascading geo-hazard that can have significant impacts on human lives and settlements worldwide, and are a driving force in landscape evolution (Fan et al., 2019). In recent years, the socioeconomic impacts of landslides have been exacerbated through global economic expansion, unplanned developmental activities, and aggravated climate change (Guzzetti et al., 2012; Guzzetti et al., 1999; Li et al., 2020a).

The processes that modulate the spatial and temporal occurrence of landslides include strong earthquakes, heavy precipitation, snowmelt, land-use changes, and other anthropogenic activities (Chen et al., 2020a; Dai et al., 2002; Dou et al., 2015c; Jaboyedoff et al., 2012; Kawagoe et al., 2010). Landslide hazards and their associated risk have been studied in detail, due to their destructive nature and socioeconomic impacts. In areas with considerable risk from landslides, susceptibility maps are a fundamental step toward hazard assessment and mitigation strategies (Dou et al., 2019c; Li et al., 2019 Van Westen et al., 2006). Such procedures are typically followed in landslide assessments and mitigation at regional or catchment scale. The use of a Geographic Information System (GIS) environment in landslide susceptibility map preparation is an effective method to identify and delineate landslide-prone areas in order to create a geospatial database of landslide occurrence, or ‘landslide inventory’. Using GIS data sources, geospatial properties of the landslide locations that may affect potential slope stability, known as Landslide Conditioning Factor (LCF), can be compiled into a database (e.g., slope angle, slope aspect, soil types, rainfall, topographic wetness and lithology type, etc.). The LCF data can then be used to model the responses of other slopes in the study area in an attempt to predict future landslide occurrence.

Different terminologies have been applied to this method of mapping and modeling over time. The term ‘landslide susceptibility mapping’ has been used in earlier studies to specifically refer to the process of identifying and mapping sites of historic landslides. More recently, it has included trying to predict locations of future events through modeling approaches, an approach referred to as ‘landslide susceptibility modeling’. In this paper, we use the acronym LSM to refer to the entire processing of mapping and modeling the susceptibility of slopes to future landslides. LSM is a key part of disaster management strategies, as it produces a map of probabilities of landslide occurrence in a geographical region. According to Brabb (1984), landslide susceptibility is defined as “the likelihood of a landslide occurring in a given area” based on the given topographical and environmental variables (i.e., LCF), and a LSM approach identifies areas in which landslides are likely to occur (Guzzetti et al., 1999).

Decision-makers and local agencies use LSM probability evaluation in order to partition the geographic surface into zones with different degrees of stability and instability. This process, known as ‘landslide risk zoning’ plays a significant role in the control, management, and counter-measures for mitigating the risks associated with known and potential future landslides. Using GIS in this approach can provide improvements in handling spatial data, provide improved processing capabilities and aid in the decision-making process.

Despite different statistical approaches, terminologies, and computation capability, LSM primarily aims at highlighting the spatial distribution of landslides based upon the following assumptions; (i) the past is the key to the future, implying that future events will likely happens in similar conditions to those that happened in the past; and (ii) LCF affecting landslide occurrence are spatially linked and therefore can be used in predictive functions (Reichenbach et al., 2018). These predictive functions can therefore be implemented through the compilation of a landslide inventory and associated geospatial LCF data.

Consequently, different quantitative techniques and approaches have been developed for LSM. Broadly, there are four main types of LSM approach: physical-based models, opinion-driven (i.e. heuristic) models, statistical models and more recently, machine learning (ML) models (Chang et al., 2019; Nguyen et al., 2019; Pham et al., 2019; Tien Bui et al., 2019; Li et al., 2019). Each of these individual approaches has been shown to have its own advantages and limitations (Bergstra et al., 2013; Khosravi et al., 2019). For instance, physical-based models involving detailed site characterization, currently deliver the highest prediction accuracy, and are suited for local-area (i.e., sub-catchment) scale mapping and analysis. Such models require a detailed understanding of the landslide system derived from local surface and subsurface observations and monitoring systems, and are typically employed to provide early warning of impending slope failure (Piciullo et al., 2018; Whiteley et al., 2019). However, for large-scale analysis (i.e., watershed/basin scale through to county/provincial/country level), physical-based models require large amounts of detailed data to provide reliable results, which comes with excessive financial and computational cost. Therefore, physical-based models are currently not practical for large area risk zonation exercises. For this reason, knowledge-based models and statistical models, which are modulated by limited information on terrain and environmental variables, have dominated the arena of LSM over the past 40 years (Guzzetti et al., 2012). Opinion-driven models are based on structuring a model based on limited information, and afterward parameterizing it by ranking and/or weighting the landslide conditioning factors based on expert opinion and expertise. This approach can be problematic as it can be hard to quantify or evaluate a result objectively. Statistical models, on the other hand, benefit the most from the advancements in GIS in the last decade, and consequently a plethora of quantitative methods and techniques have been proposed and implemented successfully for modeling landslides that aid in understanding landslide patterns and their triggering mechanisms (Dou et al., 2019b; Thai et al., 2016). Since the early days of statistical predictive modeling, the progression in understanding landslide susceptibility has been astonishingly rapid. In the last two decades, many different landslide susceptibility models emerging from various statistical approaches have been employed in the ML environment to obtain accurate risk zonation maps.

The margin between statistical models and ML is a subject of debate (Bzdok et al., 2018). The association and differences between statistical and ML modeling approaches are not well explained in the landslide susceptibility literature, primarily because producing and delivering accurate LSM results is a higher priority for geoscientists and geo-researchers than defining and classifying algorithms. By definition, ML learns from data without banking on rules-based functions, whereas statistical modeling streamlines relationships between variables in the data by means of mathematical equations. Although in the past the two fields were considered exclusive (Fig. 1a), they have converged in recent times (Fig. 1b). A case in point is the use of Logistic Regression (LR) algorithms, initially a statistical model for solving binary classification problems. LR was borrowed by ML from the field of statistical models and is currently one of the most widely used ML algorithms. Similarly, Bootstrap (Kulesa et al., 2015) is a method used in statistical inference, but is also applied regularly in Random Forest (RF) algorithms. Nevertheless, ML emphasizes optimization and performance rather than the inference, which is the primary concern of statistical models. The nascence of ML in LSM means there are few instructive resources for better understanding aimed at those who are not experts in ML, and so it is indispensable and timely to prepare an overview of the different ML algorithms available and provide a comparison between the available learning algorithms for LSM.

In the literature, several studies have implemented different ML algorithms for LSM (Camilo et al., 2017; Pham et al., 2019; Tien Bui et al., 2019). Machine learning has flourished in other fields of science since the 1990s, with major developments including the implementation of neural networks, development of boosting algorithms, and increased accessibility to internet-derived and digital data. Consequently, ML was first used in the field of landslides in the early to mid-2000s (Fig. 2). Logistic regression (LR) and Artificial Neural Network (NNET) algorithms were the earliest ML methods applied to LSM and have a total article count of 1587 and 746 respectively since 2000. In the search for more accurate LSM products, more recently researchers have used highly sophisticated algorithms such as Support Vector Machine (SVM), Decision Tree (DT) and Random Forest (RF) algorithms, with their popularity increasing from 2010 onward. There have been 342 publications using SVM, 247 using DT and 169 publications using RF techniques since 2000. Other ML algorithms are rarely applied in LSM (Fig. 2) for two likely reasons; (i) SVM, DT, and RF attain over 90% prediction accuracy which can currently be seen as a realistic upper limit in LSM modeling (Chang et al., 2019; Dou et al., 2020a); (ii) other ML models have been developed more recently, and have increased complexities in their applications which require advanced knowledge of ML processing to implement successfully.

The literature surrounding specific aspects of landslide hazard and landslide susceptibility has grown over the past three decades. Several benchmark publications presenting case studies, models, and reviews on the susceptibility mapping and modeling process have been identified. Among them, the 15 top-cited publications are listed in Fig. 3, with most of this literature published before 2010. Noticeably, only a handful of researchers are involved in investigating the complexity of susceptibility models using ML models. A few studies have reviewed progress in the wider area of modeling (e.g., Budimir et al., 2015; Reichenbach et al., 2018; Rossi et al., 2010). Based on the Web of Science (WoS) database, we identify ten authors whose contribution have the most substantial proportion of all published literature on ML for LSM studies (Fig. 4). These ten authors are responsible for approximately 30% of published LR studies, 47% of published NNET studies, 70% of published RF studies, 83% of published DT studies, and 86% of published SVM studies. Although SVM, RF, and DT are more recent additions to the range of ML models available for LSM, the article share of these researchers is significant. Among the top five authors, four of them are affiliated to institutes in Malaysia, Norway, Iran, and Vietnam, underscoring the sizeable population of articles from these nations.

Furthermore, Fig. 5 shows publications using ML in LSM by country, and displays the countries with most publications. In the WoS database, ‘country’ refers to the location of the author's affiliation, rather than the location of LSM studies. Although it is not always the case, the author affiliation often reflects the study area. For example, China tops the list with maximum publication in all kinds of ML for LSM, and also has one of the highest incidence of landslide occurrence in the world (Kirschbaum et al., 2010). On the other hand, the Netherlands, where one-third of the land lies below sea level, does not experience significant risk from landslide hazards, but is listed 16th in the Fig. 5. This can be attributed to the research outcomes from graduates and researchers at the International Institute for Geo-Information Science and Earth Observation (ITC), University of Twente, a leading research institute in GIS applications for natural hazards. Nepal, which topped the list of countries with the highest percentage of landslide reports by Kirschbaum et al., (2010), is not found in the top 18 countries using ML for LSM studies. This suggests an absence of ML researcher affiliation within the country.

The top journals and their percentage share of publications using ML for LSM are shown in Fig. 6. Journals such as Environmental Earth Sciences (EES), Geomorphology (GEM), Landslides (LAN), Catena (CAT), and Geomatics Natural Hazard Risks (GNH) are a common choice for studies using ML for LSM. The next most popular choices are Engineering Geology (ENG) and Natural Hazards (NAH). Studies using advanced ML techniques such as SVM, DT, and RF are found in the journals Remote Sensing (REM) and Science of Total Environment (SCT). Regional works on LSM using earlier ML techniques such as LR and NNET are also popular in Arabian Journal of Geosciences (ARA) and Natural Hazards and Earth System Sciences (NHE).

Overall, 42% of the articles using LR techniques for LSM in the WoS database were published in the ten journals listed in Fig. 6a. In addition, 49% of publications using NNET and/or LSM methods (Fig. 6b), 48% using SVM and/or LSM (Fig. 6c); 47% using DT and/or LSM (Fig. 6d); and 49% using RF techniques (Fig. 6e) were published in these journals, all of which are relevant to the study of natural hazards and geomorphology. Nevertheless, no comprehensive reviews have been undertaken focusing exclusively on the use of ML in LSM in order to present the complexities, comparisons, challenges, and opportunities for the future. Hence, this review builds upon the aforementioned body of literature (Fig. 3).

The concepts and terminology surrounding ML and its applications in LSM can be unfamiliar to geoscientists and geomorphologists without computing and statistical backgrounds, and therefore, Section 2 is devoted to detailing the architecture of the most popular ML algorithms used for landslide susceptibility studies. The ML algorithms presented include Logistic Regression, Artificial Neural Network, Support Vector Machine, Decision Tree, Random Forest, Naïve Bayes, Quadratic Discriminant Analysis, K-Nearest Neighbors, and Gradient Boosting algorithms.

To date, no consensus about which ML algorithm is the ‘best’ suited for predicting landslide-prone areas has been identified (Dou et al., 2020a; Y. Li et al., 2020b; Sevgen et al., 2019). It has been postulated in many studies that the prediction accuracy of landslide modeling is influenced by not only the quality of data behind landslide inventories and landslide conditioning factors but also the fundamental quality of the ML algorithm used (Nhu et al., 2020; Yilmaz, 2009). Therefore, Section 3 assesses and compares the prediction capabilities of different ML algorithms for LSM approaches by considering a case study from Algeria.

When advanced ML techniques are used, prediction results can attain accuracies in excess of 90% (e.g., Dou et al., 2019a). However, researchers are still aiming to develop and apply additional models to produce more accurate outputs. Section 4 focuses on discussing the performance of the ML models and presents the challenges, limitations and future opportunities for using ML methods in LSM. The concluding remarks from this review are presented in Section 5.

Section snippets

Machine learning model architecture

Machine learning techniques have proven to be a standard solution for addressing big-data spatial analytics where the extent of the theoretical knowledge of a problem is incomplete (Lary et al., 2016) and when statistical pre-assumptions are unreliable or not known (Dou et al., 2019a). Due to these factors, and combined with their robustness as one of the ideal techniques for solving non-linear geo-environmental issues, ML techniques are increasingly used in LSM. Using either regression or

Comparative analysis of Machine Learning Models in Landslide Susceptibility Studies

Wolpert (1996) introduced the ‘No free lunch’ (NFL) concept, which was summarized as “any two algorithms are equivalent when their performance is averaged across all possible problems”. The NFL concept applies to the current state of ML modeling in general and spatial prediction of landslides in particular as “no single or particular model can be depicted as the most suitable for all case scenarios”. This is because of the difficulty in assessing whether an implemented ML model provides a

Discussion, challenges, and future directions

Regional landslide susceptibility mapping is a hot topic, due to the constant risks posed in many parts of the world. It is a critical step in the prediction and mitigation of future landslide occurrence, but requires substantial resources and can be difficult to implement due to the non-linear characteristics of LSM datasets. Although various methodologies for producing landslide susceptibility maps have been developed, the prediction accuracy of these methods is still debated (Su et al., 2017

Concluding remarks

In this article, we have provided a summary of machine learning models used for landslide susceptibility modeling, including identifying the recent trends in the use of ML methods for LSM, and presenting the basic architecture of the most popular ML methods. Subsequently, we formulated a comprehensive framework for comparing and assessing machine learning models to identify areas susceptible to the occurrence of landslides. This was achieved by systematically passing different landslide

Glossary

Bayes' theorem - Also ‘Bayes’ law’ or ‘Bayes' rule’, describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Bagging – See Bootstrap.

Black-box model – A common metaphor used in computer programming referring to a system for which we can only observe the inputs and outputs, but not the internal workings (see also White-box models).

Bootstrap - The bootstrap (‘bootstrap aggregating’ or simply ‘bagging’) method is a resampling technique used

Author contributions

DJ, AB, and YAP were responsible for coordinating with all co-authors. AB performed the analysis with contribution from DJ and YAP. AB, YAP, and DJ generating most of the figures, with input from all authors. AB, DJ, YA, JW, BTP, DTB, RA, BA contributed to writing and provided helpful discussions.

Declaration of Competing Interest

No conflict of interest exists.

Acknowledgments

This research is supported by the National Natural Science Fundation of China (No.41827808) and open fund (SKHL1903) from State Key Laboratory of Hydraulics and Mountain River Engineering, Sichuan University, JSPS Program, and CAS Pioneer Hundred Talents Program. Authors sincerely thank the Editor Shuhab Khan, and the two reviewers for their constructive and detailed comments. Jim Whiteley publishes with the permission of the Executive Director, British Geological Survey (UKRI-NERC)

References (151)

  • Q. Guo et al.

    Support vector machines for predicting distribution of Sudden Oak Death in California

    Ecol. Model.

    (2005)
  • F. Guzzetti et al.

    Landslide hazard evaluation: a review of current techniques and their application in a multi-scale study, Central Italy

    Geomorphology

    (1999)
  • F. Guzzetti et al.

    Landslide inventory maps: New tools for an old problem

    Earth-Science Rev.

    (2012)
  • C. Li et al.

    Susceptibility of reservoir-induced landslides and strategies for increasing the slope stability in the Three Gorges Reservoir Area: Zigui Basin as an example

    Eng. Geol.

    (2019)
  • I. Kaastra et al.

    Designing a neural network for forecasting financial and economic time series

    Neurocomputing

    (1996)
  • K. Khosravi et al.

    A comparative assessment of decision trees algorithms for flash flood susceptibility modeling at Haraz watershed, northern Iran

    Sci. Total Environ.

    (2018)
  • K. Khosravi et al.

    A comparative assessment of flood susceptibility modeling using Multi-Criteria Decision-Making Analysis and Machine Learning Methods

    J. Hydrol.

    (2019)
  • D.J. Lary et al.

    Machine learning in geosciences and remote sensing

    Geosci. Front.

    (2016)
  • M.F. Møller

    A scaled conjugate gradient algorithm for fast supervised learning

    Neural Netw.

    (1993)
  • E. Alpaydin

    Introduction to Machine Learning

    (2009)
  • C. Ballabio et al.

    Support Vector Machines for Landslide Susceptibility Mapping: the Staffora River Basin Case Study, Italy

    Math. Geosci.

    (2012)
  • D. Basak et al.

    Support vector regression

    Neural Inf. Process. Rev.

    (2007)
  • J. Bergstra et al.

    Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures

  • E.E. Brabb

    Innovative approaches to landslide hazard mapping

  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • L. Breiman et al.

    Random Forests: Finding Quasars

  • J. Bröcker et al.

    Increasing the Reliability of Reliability Diagrams

    Weather Forecast.

    (2007)
  • M.E.A. Budimir et al.

    A systematic review of landslide probability mapping using logistic regression

    Landslides.

    (2015)
  • D. Bzdok et al.

    Statistics versus machine learning

    Nat. Publ. Gr.

    (2018)
  • D.C. Camilo et al.

    Handling high predictor dimensionality in slope-unit-based landslide susceptibility models through LASSO-penalized Generalized Linear Model

    Environ. Model. Softw.

    (2017)
  • F. Catani et al.

    Landslide susceptibility estimation by random forests technique: sensitivity and scaling issues

    Nat. Hazards Earth Syst. Sci.

    (2013)
  • K.-T. Chang et al.

    Evaluating scale effects of topographic variables in landslide susceptibility models using GIS-based machine learning techniques

    Sci. Rep.

    (2019)
  • W. Chen et al.

    GIS-based landslide susceptibility modelling: a comparative assessment of kernel logistic regression, Naïve-Bayes tree, and alternating decision tree models

    Geomat. Nat. Haz. Risk.

    (2017)
  • Y. Chen et al.

    Relationship between water content, shear deformation, and elastic wave velocity through unsaturated soil slope

    Bull. Eng. Geol. Environ.

    (2020)
  • J.S. Chen et al.

    A kNN based position prediction method for SNS places

  • V. Cherkassky et al.

    Selection of meta-parameters for support vector regression

  • W. Chettah et al.

    Investigation des propriétés minéralogiques et géomécaniques des terrains en mouvement dans la ville de Mila «Nord-Est d'Algérie». Sci. la terre l'univers

    (2009)
  • C.-M. Chu et al.

    Integrating Decision Tree and Spatial Cluster Analysis for Landslide Susceptibility Zonation

    World Acad. Sci. Eng. Technol.

    (2009)
  • M. Clerc et al.

    The particle swarm - explosion, stability, and convergence in a multidimensional complex space

    IEEE Trans. Evol. Comput.

    (2002)
  • A. Cobham

    The intrinsic computational difficulty of functions

  • P.-E. Coiffait

    Un bassin post-nappes dans son cadre structural: l'exemple du bassin de Constantine (Algérie Nord-Orientale)

    (1992)
  • C. Cortes et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • N. Cristianini et al.

    Support Vector Machines and Kernel Methods: The New Generation of Learning Machines

    Artif. Intell. Mag.

    (2002)
  • C.F. Dormann et al.

    Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

    Ecography (Cop.).

    (2012)
  • J. Dou et al.

    Optimization of causative factors for landslide susceptibility evaluation using remote sensing and GIS data in parts of Niigata, Japan

    PLoS One

    (2015)
  • J. Dou et al.

    Shallow and Deep-Seated Landslide Differentiation using support Vector Machines: a Case Study of the Chuetsu Area, Japan

    Terr. Atmos. Ocean. Sci.

    (2015)
  • J. Dou et al.

    An integrated artificial neural network model for the landslide susceptibility assessment of Osado Island, Japan

    Nat. Hazards

    (2015)
  • J. Dou et al.

    Evaluating GIS-Based Multiple Statistical Models and Data Mining for Earthquake and Rainfall-Induced Landslide Susceptibility Using the LiDAR DEM

    Remote Sens.

    (2019)
  • J. Dou et al.

    Torrential rainfall-triggered shallow landslide characteristics and susceptibility assessment using ensemble data-driven models in the Dongjiang Reservoir Watershed, China

    Nat. Hazards

    (2019)
  • J. Dou et al.

    Improved landslide assessment using support vector machine with bagging, boosting, and stacking ensemble machine learning framework in a mountainous watershed, Japan

    Landslides

    (2020)
  • Cited by (518)

    View all citing articles on Scopus
    1

    These authors contributed equally.

    View full text