Fusing imperfect experimental data for risk assessment of musculoskeletal disorders in construction using canonical polyadic decomposition

https://doi.org/10.1016/j.autcon.2020.103322Get rights and content

Highlights

  • Missing data is a common problem in data collection for WMSD risk assessment.

  • A data fusion method was developed to tackle this problem.

  • Canonical Polyadic Decomposition was applied in the method development.

  • Two real WMSD risk datasets were used to validate the developed method.

  • The method was found useful in handling missing data for reliable risk assessment.

Abstract

Field or laboratory data collected for work-related musculoskeletal disorder (WMSD) risk assessment in construction often becomes unreliable as a large amount of data go missing due to technology-induced errors, instrument failures or sometimes at random. Missing data can adversely affect the assessment conclusions. This study proposes a method that applies Canonical Polyadic Decomposition (CPD) tensor decomposition to fuse multiple sparse risk-related datasets and fill in missing data by leveraging the correlation among multiple risk indicators within those datasets. Two knee WMSD risk-related datasets—3D knee rotation (kinematics) and electromyography (EMG) of five knee postural muscles—collected from previous studies were used for the validation and demonstration of the proposed method. The analysis results revealed that for a large portion of missing values (40%), the proposed method can generate a fused dataset that provides reliable risk assessment results highly consistent (70%–87%) with those obtained from the original experimental datasets. This signified the usefulness of the proposed method for use in WMSD risk assessment studies when data collection is affected by a significant amount of missing data, which will facilitate reliable assessment of WMSD risks among construction workers. In the future, findings of this study will be implemented to explore whether, and to what extent, the fused dataset outperforms the datasets with missing values by comparing consistencies of the risk assessment results obtained from these datasets for further investigation of the fusion performance.

Introduction

Work-related musculoskeletal disorders (WMSDs) are one of the most common causes of days away from work and physical disabilities in the construction industry [1]. An increased exposure to risk factors in the workplace can enhance the likelihood of WMSDs; hence, proper identification of possible risk exposures and developing injury prevention strategies are essential to alleviate WMSDs.

Collecting risk exposure data with human subject involvement is most often accepted as the gold standard for understanding risky behaviors and conditions that may expose workers to WMSD risks on construction sites [2]. Generally, these data are collected in laboratory settings or real construction sites by technologies such as optical motion capture systems or surface electromyography sensors. However, data collected from these technologies often suffer from ‘drop out’, a phenomenon in which data is missing due to technology-induced errors (e.g., disconnection of sensors, errors in communicating with the database server, instrument failures), human-induced errors (e.g., accidental human omission) or other unknown reasons [3]. The result is incompleteness of the collected risk exposure data that may lead to invalid conclusions on the effects of the potential WMSD risk factors. Missing data is a common problem associated with data collection in ergonomic risk assessment using technologies, regardless of the quality of the research design [4]. Therefore, it should be carefully handled. In doing so, reserving the interrelation among the potential risk factors and the risk indicators across multiple datasets is necessary. Among several benefits of data fusion, one is revealing the latent pattern of the data and leveraging collaborative relationships among various factors within multiple datasets based on that pattern. This benefit can be utilized to reserve the interrelation among different factors and the risk indicators across the datasets to fill in the missing data. This study proposes a method for dealing with multiple imperfect and incomplete datasets by applying a Canonical Polyadic Decomposition (CPD) technique to treat the imperfect data for WMSD risk assessment. CPD decomposes the incomplete datasets based on the latent relationship among different risk factors and the risk indicators, then reconstructs a new dataset through fusion as a high-order tensor [5]. This newly reconstructed dataset is referred to as fused dataset, which can then be used for assessing the risk of WMSDs. To validate the effectiveness of the CPD-based method in assessing WMSDs, two WMSD risk-related datasets collected from prior experimental studies (original datasets) were intentionally modified to represent incomplete datasets. Then CPD was applied for fusion and to reconstruct the fused datasets. The risk assessment results obtained using the fused datasets were further compared to those obtained by using the original datasets to evaluate the performance of the fusion treatment.

Section snippets

Importance of research

Missing data is a common problem in research studies that involve human subjects and technologies for data collection. They can reduce statistical power of a study and lead to erroneous conclusions [6]. To potentially mitigate this issue, the sample size for data collection is typically increased. However, this is not always possible due to research design, limitations in budget and human resources. It is not always feasible to regenerate the data by repeating the experiment, as it can be

Problem statement and research objective

For assessing WMSD risks among construction workers, human-based data can potentially suffer from missing data points due to dropout from the data collection technology. However, an in-depth method in handling multiple imperfect datasets for assessing WMSD risks is missing in the existing literature. Tensor decomposition-based data fusion can be potentially useful in this regard. It may help understand the data distribution of each dataset and consider the correlation among risk indicators

Proposed method

Fig. 2 provides a schematic overview of the proposed method. First, multiple risk-related incomplete datasets captured in multiple experimental settings are represented as high-dimensional tensors. Then these tensors are fused by applying the CPD tensor decomposition. The CPD first decomposes these tensors into factor matrices that represent the latent structures and collaborative relationships among all dimensions. Based on the correlation among different dimensions, the CPD then reconstructs

Original datasets collected from previous experiments

The current study considered two risk-related datasets that were collected from the authors' prior human subject laboratory experimental studies. These studies assessed work-related factors for knee WMSDs among residential construction roofers who work on sloped environments. One dataset contains calculated knee rotation (kinematics) data, representing five knee rotational angles – flexion, abduction, adduction, internal and external rotation [42]. The second data set contains EMG data,

Discussion and study limitations

To ensure reliable risk assessment of WMSD, proper data collection is crucial. However, human-based data can potentially have missing data points due to dropout from the data collection technology. Moreover, risks sometimes cannot be fully quantified with a single risk indicator and thus multiple heterogeneous risk indicators are often collected for risk assessment. As a result, a viable method is needed that reserves the interrelation among multiple risk indicators and potential risk factors

Conclusion and future extension

The current research proposed a method that applies the CPD tensor decomposition technique to fuse multiple imperfect and incomplete datasets, as well as replacing missing data for assessing WMSD risks among construction workers. The proposed method helps not only in replacing missing values, but also holds the correlation among the potential risk factors and the risk indicators during replacement. The method was validated by comparing the risk assessment results obtained from the fused

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors acknowledge the support of the National Institute for Occupational Safety and Health (NIOSH), who funded this research. The findings and conclusions in this research are those of the authors and do not necessarily represent the opinion of the National Institute for Occupational Safety and Health, Centers for Disease Control and Prevention.

References (54)

  • BLS, Nonfatal occupational injuries and illnesses: cases with days away from work,...
  • G. David

    Ergonomic methods for assessing exposure to risk factors for work-related musculoskeletal disorders

    Occup. Med.

    (2005)
  • M.C. Data

    Secondary Analysis of Electronic Health Records

    (2016)
  • W. Young et al.

    A survey of methodologies for the treatment of missing values within datasets: limitations and benefits

    Theor. Issues Ergon. Sci.

    (2011)
  • S. Rabanser et al.

    Introduction to tensor decompositions and their applications in machine learning

  • J.Y. Lin et al.

    How to avoid missing data and the problems they pose: design considerations

    Shanghai Arch. Psychiatry

    (2012)
  • H. Kang

    The prevention and handling of the missing data

    Korean J. Anesthesiol.

    (2013)
  • K.S. Button et al.

    Power failure: why small sample size undermines the reliability of neuroscience

    Nat. Rev. Neurosci.

    (2013)
  • N.K. Malhotra

    Analyzing marketing research data with incomplete information on the dependent variable

    J. Mark. Res.

    (1987)
  • R.M. Hamer et al.

    Last observation carried forward versus mixed models in the analysis of psychiatric clinical trials

    Am. J. Psychiatr.

    (2009)
  • A. Gelman et al.

    Using conditional distributions for missing-data imputation

    Stat. Sci.

    (2001)
  • D. Lahat et al.

    Multimodal data fusion: an overview of methods, challenges, and prospects

    Proc. IEEE

    (2015)
  • J. Luengo et al.

    Missing data imputation for fuzzy rule-based classification systems

    Soft. Comput.

    (2012)
  • R. Winkler et al.

    Fuzzy c-means in high dimensional spaces

    Int. J. Fuzzy Syst. Appl.

    (2011)
  • F. Castanedo

    A review of data fusion techniques

    Sci. World J. 2013

    (2013)
  • S.N. Razavi et al.

    Reliability-based hybrid data fusion method for adaptive location estimation in construction

    J. Comput. Civ. Eng.

    (2011)
  • A. Shahi et al.

    Activity-based data fusion for automated progress tracking of construction projects

  • View full text