Elsevier

Applied Soft Computing

Volume 111, November 2021, 107728
Applied Soft Computing

DanHAR: Dual Attention Network for multimodal human activity recognition using wearable sensors

https://doi.org/10.1016/j.asoc.2021.107728Get rights and content

Highlights

  • We propose a DanHAR method to improve the comprehensibilityof multimodal sensing HAR.

  • The DanHAR achieves state-of-the-art performance across multiple public HAR datasets.

  • Visualizing analysis of attention weights in both spatial and temporal domains is provided.

  • The attention method can effectively aid to ground truth annotation of sensor signals.

Abstract

In the paper, we present a new dual attention method called DanHAR, which blends channel and temporal attention on residual networks to improve feature representation ability for sensor-based HAR task. Specially, the channel attention plays a key role in deciding what to focus, i.e., sensor modalities, while the temporal attention can focus on the target activity from a long sensor sequence to tell where to focus. Extensive experiments are conducted on four public HAR datasets, as well as weakly labeled HAR dataset. The results show that dual attention mechanism is of central importance for many activity recognition tasks. We obtain 2.02%, 4.20%, 1.95%, 5.22% and 5.00% relative improvement over regular ConvNets respectively on WISDM dataset, UNIMIB SHAR dataset, PAMAP2 dataset, OPPORTUNITY dataset, as well as weakly labeled HAR dataset. The DanHAR is able to surpass other state-of-the-art algorithms at negligible computational overhead. Visualizing analysis is conducted to show that the proposed attention can capture the spatial–temporal dependencies of multimodal sensing data, which amplifies the more important sensor modalities and timesteps during classification. The results are in good agreement with normal human intuition.

Introduction

The last few years have seen the success of ubiquitous sensing, which aims to extract knowledge from the data acquired by pervasive sensors [1]. Sensor based human activity recognition (HAR) has become a very active research field, which plays a key part in a variety of applications such as sports, interactive gaming, smart home and monitoring systems for general purposes [2]. In essence, multimodal HAR can be treated as a multivariate time series classification problem, in which sliding window technique is used to divide sensor signals into multiple pieces in order to extract their discriminative features [3]. Each time window can further be recognized by utilizing conventional machine learning methods such as support vector machine (SVM) [4], k-nearest neighbors (KNN) and naive Bayes. Though these shallow learning methods have made considerable achievements on inferring activity details, they heavily depend on hand-crafted feature extraction requiring expert knowledge [5], [6] , which are task or application dependent and cannot be directly adopted for other similar activity recognition tasks. In addition, the shallow learning is hard to capture the salient characteristics of complex activities, which usually involves laborious or time-consuming process to choose optimal features [7]. To tackle above challenge, the researches that delve into automatic feature extraction without human intervention have become an active research area. Multimodal HAR research is undergoing a transition from shallow learning to deep learning [8].

During recent years, deep learning has become an important research trend in HAR area, where different layers are stacked to construct deep neural networks (DNNs) that is able to achieve clear performance gain without the need for hand-crafted features [9]. Particularly, convolutional neural networks (CNNs) have significantly pushed the state-of-the-art (SOTA) of HAR tasks due to their rich representation power. With increasing model capacity for rich representation, DNNs will have higher performance, but inevitably lead to more requirements for strictly labeled data. One challenge for deep HAR is to collect annotated or ‘ground truth labeled’ data [10]. It is very tedious or laborious for ground truth annotation, in which one has to label all activity instances manually via skimming through raw sensor sequence. However, the time series data recorded from multimodal embedded sensors such as accelerometer or gyroscope is far more difficult to understand than other sensor modalities, such as cameras. It requires laborious human efforts to accurately segment and label a target activity from a long sensor sequence. On the whole, although these DNN models can automatically extract optimal features for classification, they still require strict annotation to label their ground-truth, which would cost much more human efforts to create a perfect benchmark HAR dataset in supervised learning scenario.

Intuitively, it is much easier for an annotator who is recording sensor data to know whether an interesting activity occurs in a long sensor sequence. If we can infer activity kinds from the coarse-grained labels and determine specific locations of every labeled target activity with only knowledge of which kinds of activities contained in the long sequence, it can greatly alleviate the burden of manual labeling [11]. Therefore, it deserves further research whether we can directly recognize and locate a target activity from coarsely labeled sensor data. We tackle the above challenges from a different aspect, i.e., attention, which recently has been extensively investigated across various research fields such as computer vision [12] and natural language processing (NLP) [13]. Similar to the human’s perception, attention attempts to focus selectively on parts of the target areas to enhance interesting details of the targets while suppressing other irrelevant and potentially confusing information. That is to say, attention can tell where to focus via improving the representation of interests.

To the best of our knowledge, attention idea has seldom been adopted in the HAR scenario. Recently, two attention methods, that combine with a Gated Recurrent Units (GRU) network [14] and a Long Short-Term Memory (LSTM) network [15] respectively, are proposed to capture the dependencies of sensing signals in both spatial and temporal domains simultaneously. Furthermore, we have proposed two attention methods, i.e., hard attention method [10] and soft attention method [16], which can focus on the target activity from a long sequence. However, compared with above attention-based GRU [14] or LSTM [15], the hard or soft attention fails to address the spatial–temporal dependencies of multimodal sensing signals. That is to say, they can only tell where to focus from the temporal information and miss the channel information, which is crucial in deciding what to focus. In computer vision field, convolution operations usually extract features via blending cross-channel and spatial information together. Recently a series of researches have been introduced to incorporate channel attention into convolution blocks [17], [18], [19], [20], [21], [22], showing great potential in performance improvement.

The channel attention has never been considered in HAR. Inspired by the idea, we aim to increase representation power and comprehensibility by incorporating the channel attention into multimodal HAR researches. In the paper, we for the first time propose a dual attention method called DanHAR for multimodal HAR scenario, which blends channel and temporal attention on a CNN model. To increase feature extraction capacity, residual network is also introduced as our backbone. We sequentially infer channel and temporal attention maps, which enable us to learn what and where in both spatial and temporal domains simultaneously. The proposed attention can capture channel features and temporal dependencies of multimodal time series sensor signals, which amplifies the more important sensor modalities and timesteps during classification. Extensive experiments are conducted to evaluate DanHAR on four public HAR datasets consisting of WISDM [23], UNIMIB SHAR [24], PAMAP2 [25] and OPPORTUNITY [26], as well as the weakly labeled dataset. We show that exploiting both is superior to using one attention alone. Visualizing analysis of attention weights is further provided to explore how attention has focused on multimodal time series sensor signals to improve the model’s comprehensibility, which indicates the superiority of DanHAR blending channel and temporal attention related to typical challenges in multimodal HAR scenarios.

Our main contribution is three-fold. Firstly, we present an efficient DanHAR method in multimodal HAR scenarios, which can improve representation power of CNN via blending channel and temporal attention together. Secondly, extensive experiments are conducted to verify that the attention method is able to achieve SOTA results with negligible overhead of parameter, across the multiple public benchmark HAR datasets as well as the weakly labeled HAR dataset. Thirdly, various ablation experiments are performed to validate the effectiveness and efficiency of the proposed method. Visualizing analysis of attention weights in both spatial and temporal domains is provided to improve the multimodal sensor signal’s comprehensibility. In addition, the proposed method can effectively aid to ‘ground truth annotation’ and alleviate its burden.

The rest of the paper is organized as follows. In Section 2, we review the related literatures of sensor based HAR. Section 3 proposes the attention in multimodal HAR scenario. Section 4 introduces the public HAR datasets and the collected weakly labeled dataset. In addition, the performance comparison and analysis are detailed from several aspects. In Section 5, the conclusion is made and the future work is discussed.

Section snippets

Related work

CNN has become the dominant technique in computer vision field, and many researchers have continuously investigated in order to improve its performance [27]. As we have known, attention plays a key role in human perception, in which one does not need to process a whole scene, instead can selectively focus on the most salient parts. Recent advances in various tasks such as image identification and nature language processing (NLP) have witnessed the success of attention [12], [13], [28]. In

Model

Previous attention framework takes no advantage of sophisticated channel-wise dependencies to mutually promote the learning for both localization and recognition. SENet [17] for the first time proposes an effective mechanism to learn channel attention and achieves promising performance. Inspired by the recent success of channel attention [17], [18], [19], [20], [21], [22], in this paper, we propose a novel DanHAR that simultaneously considers channel and temporal attention for multimodal time

Experiments

Before using deep models, the raw sensor time series need to be first segmented based on sampling rate. Sliding window technique has been extensively employed to perform segmentation, in which streams of sensor signal are usually split into continuous subs-sequences called windows, and each window is assigned with a specific activity label. Such parameters have an important effect on classification performance. However, for various datasets, there is still no clear consensus on the optimal

Conclusion

Recognizing human activities from multimodal sensor data is a challenging task. In the paper, we for the first time propose a dual attention method called DanHAR, which uses channel attention and temporal attention simultaneously to better understand the decisions of deep networks for various multimodal HAR tasks. DanHAR adopted a hybrid framework by combining dual attention mechanism to fuse multimodal sensor information, which has better capability to capture temporal–spatial patterns in

CRediT authorship contribution statement

Wenbin Gao: Conceptualization, Data curation, Investigation. Lei Zhang: Writing – original draft, Writing – reviewing. Qi Teng: Visualization. Jun He: Methodology. Hao Wu: Supervision, Guidance.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (41)

  • SoleimaniE. et al.

    Cross-subject transfer learning in human activity recognition systems using generative adversarial networks

    Neurocomputing

    (2019)
  • HeJ. et al.

    Weakly supervised human activity recognition from wearable sensors by recurrent attention learning

    IEEE Sens. J.

    (2018)
  • ZagoruykoS. et al.

    Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

    (2016)
  • T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, C. Zhang, Disan: Directional self-attention network for rnn/cnn-free...
  • H. Ma, W. Li, X. Zhang, S. Gao, S. Lu, Attnsense: Multi-level attention mechanism for multimodal human activity...
  • M. Zeng, H. Gao, T. Yu, O.J. Mengshoel, H. Langseth, I. Lane, X. Liu, Understanding and improving recurrent networks...
  • WangK. et al.

    Attention-based convolutional neural network for weakly labeled human activities’ recognition with wearable sensors

    IEEE Sens. J.

    (2019)
  • J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and...
  • Z. Gao, J. Xie, Q. Wang, P. Li, Global second-order pooling convolutional networks, in: Proceedings of the IEEE...
  • S. Woo, J. Park, J.-Y. Lee, I. So Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European...
  • Cited by (89)

    • A hybrid attention-guided ConvNeXt-GRU network for action recognition

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus

    This work was supported in part by the National Natural Science Foundation of China under Grant 61203237 and the Joint Project of Industry–University-Research of Jiangsu Province, China under Grant BY2016001-02, and in part by the Natural Science Foundation of Jiangsu Province, China under grant BK20191371.

    View full text