DanHAR: Dual Attention Network for multimodal human activity recognition using wearable sensors

doi:10.1016/j.asoc.2021.107728

Applied Soft Computing

Volume 111, November 2021, 107728

https://doi.org/10.1016/j.asoc.2021.107728 Get rights and content

Highlights

•
We propose a DanHAR method to improve the comprehensibilityof multimodal sensing HAR.
•
The DanHAR achieves state-of-the-art performance across multiple public HAR datasets.
•
Visualizing analysis of attention weights in both spatial and temporal domains is provided.
•
The attention method can effectively aid to ground truth annotation of sensor signals.

Abstract

In the paper, we present a new dual attention method called DanHAR, which blends channel and temporal attention on residual networks to improve feature representation ability for sensor-based HAR task. Specially, the channel attention plays a key role in deciding what to focus, i.e., sensor modalities, while the temporal attention can focus on the target activity from a long sensor sequence to tell where to focus. Extensive experiments are conducted on four public HAR datasets, as well as weakly labeled HAR dataset. The results show that dual attention mechanism is of central importance for many activity recognition tasks. We obtain 2.02%, 4.20%, 1.95%, 5.22% and 5.00% relative improvement over regular ConvNets respectively on WISDM dataset, UNIMIB SHAR dataset, PAMAP2 dataset, OPPORTUNITY dataset, as well as weakly labeled HAR dataset. The DanHAR is able to surpass other state-of-the-art algorithms at negligible computational overhead. Visualizing analysis is conducted to show that the proposed attention can capture the spatial–temporal dependencies of multimodal sensing data, which amplifies the more important sensor modalities and timesteps during classification. The results are in good agreement with normal human intuition.

Introduction

The last few years have seen the success of ubiquitous sensing, which aims to extract knowledge from the data acquired by pervasive sensors [1]. Sensor based human activity recognition (HAR) has become a very active research field, which plays a key part in a variety of applications such as sports, interactive gaming, smart home and monitoring systems for general purposes [2]. In essence, multimodal HAR can be treated as a multivariate time series classification problem, in which sliding window technique is used to divide sensor signals into multiple pieces in order to extract their discriminative features [3]. Each time window can further be recognized by utilizing conventional machine learning methods such as support vector machine (SVM) [4], k-nearest neighbors (KNN) and naive Bayes. Though these shallow learning methods have made considerable achievements on inferring activity details, they heavily depend on hand-crafted feature extraction requiring expert knowledge [5], [6] , which are task or application dependent and cannot be directly adopted for other similar activity recognition tasks. In addition, the shallow learning is hard to capture the salient characteristics of complex activities, which usually involves laborious or time-consuming process to choose optimal features [7]. To tackle above challenge, the researches that delve into automatic feature extraction without human intervention have become an active research area. Multimodal HAR research is undergoing a transition from shallow learning to deep learning [8].

During recent years, deep learning has become an important research trend in HAR area, where different layers are stacked to construct deep neural networks (DNNs) that is able to achieve clear performance gain without the need for hand-crafted features [9]. Particularly, convolutional neural networks (CNNs) have significantly pushed the state-of-the-art (SOTA) of HAR tasks due to their rich representation power. With increasing model capacity for rich representation, DNNs will have higher performance, but inevitably lead to more requirements for strictly labeled data. One challenge for deep HAR is to collect annotated or ‘ground truth labeled’ data [10]. It is very tedious or laborious for ground truth annotation, in which one has to label all activity instances manually via skimming through raw sensor sequence. However, the time series data recorded from multimodal embedded sensors such as accelerometer or gyroscope is far more difficult to understand than other sensor modalities, such as cameras. It requires laborious human efforts to accurately segment and label a target activity from a long sensor sequence. On the whole, although these DNN models can automatically extract optimal features for classification, they still require strict annotation to label their ground-truth, which would cost much more human efforts to create a perfect benchmark HAR dataset in supervised learning scenario.

Intuitively, it is much easier for an annotator who is recording sensor data to know whether an interesting activity occurs in a long sensor sequence. If we can infer activity kinds from the coarse-grained labels and determine specific locations of every labeled target activity with only knowledge of which kinds of activities contained in the long sequence, it can greatly alleviate the burden of manual labeling [11]. Therefore, it deserves further research whether we can directly recognize and locate a target activity from coarsely labeled sensor data. We tackle the above challenges from a different aspect, i.e., attention, which recently has been extensively investigated across various research fields such as computer vision [12] and natural language processing (NLP) [13]. Similar to the human’s perception, attention attempts to focus selectively on parts of the target areas to enhance interesting details of the targets while suppressing other irrelevant and potentially confusing information. That is to say, attention can tell where to focus via improving the representation of interests.

To the best of our knowledge, attention idea has seldom been adopted in the HAR scenario. Recently, two attention methods, that combine with a Gated Recurrent Units (GRU) network [14] and a Long Short-Term Memory (LSTM) network [15] respectively, are proposed to capture the dependencies of sensing signals in both spatial and temporal domains simultaneously. Furthermore, we have proposed two attention methods, i.e., hard attention method [10] and soft attention method [16], which can focus on the target activity from a long sequence. However, compared with above attention-based GRU [14] or LSTM [15], the hard or soft attention fails to address the spatial–temporal dependencies of multimodal sensing signals. That is to say, they can only tell where to focus from the temporal information and miss the channel information, which is crucial in deciding what to focus. In computer vision field, convolution operations usually extract features via blending cross-channel and spatial information together. Recently a series of researches have been introduced to incorporate channel attention into convolution blocks [17], [18], [19], [20], [21], [22], showing great potential in performance improvement.

The channel attention has never been considered in HAR. Inspired by the idea, we aim to increase representation power and comprehensibility by incorporating the channel attention into multimodal HAR researches. In the paper, we for the first time propose a dual attention method called DanHAR for multimodal HAR scenario, which blends channel and temporal attention on a CNN model. To increase feature extraction capacity, residual network is also introduced as our backbone. We sequentially infer channel and temporal attention maps, which enable us to learn what and where in both spatial and temporal domains simultaneously. The proposed attention can capture channel features and temporal dependencies of multimodal time series sensor signals, which amplifies the more important sensor modalities and timesteps during classification. Extensive experiments are conducted to evaluate DanHAR on four public HAR datasets consisting of WISDM [23], UNIMIB SHAR [24], PAMAP2 [25] and OPPORTUNITY [26], as well as the weakly labeled dataset. We show that exploiting both is superior to using one attention alone. Visualizing analysis of attention weights is further provided to explore how attention has focused on multimodal time series sensor signals to improve the model’s comprehensibility, which indicates the superiority of DanHAR blending channel and temporal attention related to typical challenges in multimodal HAR scenarios.

Our main contribution is three-fold. Firstly, we present an efficient DanHAR method in multimodal HAR scenarios, which can improve representation power of CNN via blending channel and temporal attention together. Secondly, extensive experiments are conducted to verify that the attention method is able to achieve SOTA results with negligible overhead of parameter, across the multiple public benchmark HAR datasets as well as the weakly labeled HAR dataset. Thirdly, various ablation experiments are performed to validate the effectiveness and efficiency of the proposed method. Visualizing analysis of attention weights in both spatial and temporal domains is provided to improve the multimodal sensor signal’s comprehensibility. In addition, the proposed method can effectively aid to ‘ground truth annotation’ and alleviate its burden.

The rest of the paper is organized as follows. In Section 2, we review the related literatures of sensor based HAR. Section 3 proposes the attention in multimodal HAR scenario. Section 4 introduces the public HAR datasets and the collected weakly labeled dataset. In addition, the performance comparison and analysis are detailed from several aspects. In Section 5, the conclusion is made and the future work is discussed.

Section snippets

Related work

CNN has become the dominant technique in computer vision field, and many researchers have continuously investigated in order to improve its performance [27]. As we have known, attention plays a key role in human perception, in which one does not need to process a whole scene, instead can selectively focus on the most salient parts. Recent advances in various tasks such as image identification and nature language processing (NLP) have witnessed the success of attention [12], [13], [28]. In

Model

Previous attention framework takes no advantage of sophisticated channel-wise dependencies to mutually promote the learning for both localization and recognition. SENet [17] for the first time proposes an effective mechanism to learn channel attention and achieves promising performance. Inspired by the recent success of channel attention [17], [18], [19], [20], [21], [22], in this paper, we propose a novel DanHAR that simultaneously considers channel and temporal attention for multimodal time

Experiments

Before using deep models, the raw sensor time series need to be first segmented based on sampling rate. Sliding window technique has been extensively employed to perform segmentation, in which streams of sensor signal are usually split into continuous subs-sequences called windows, and each window is assigned with a specific activity label. Such parameters have an important effect on classification performance. However, for various datasets, there is still no clear consensus on the optimal

Conclusion

Recognizing human activities from multimodal sensor data is a challenging task. In the paper, we for the first time propose a dual attention method called DanHAR, which uses channel attention and temporal attention simultaneously to better understand the decisions of deep networks for various multimodal HAR tasks. DanHAR adopted a hybrid framework by combining dual attention mechanism to fuse multimodal sensor information, which has better capability to capture temporal–spatial patterns in

CRediT authorship contribution statement

Wenbin Gao: Conceptualization, Data curation, Investigation. Lei Zhang: Writing – original draft, Writing – reviewing. Qi Teng: Visualization. Jun He: Methodology. Hao Wu: Supervision, Guidance.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (41)

WangJ. et al.
Deep learning for sensor-based activity recognition: A survey
Pattern Recognit. Lett.
(2019)
Reyes-OrtizJ.-L. et al.
Transition-aware human activity recognition using smartphones
Neurocomputing
(2016)
LiuL. et al.
Sensor-based human activity recognition system with a multilayered model using time series shapelets
Knowl.-Based Syst.
(2015)
GonzálezS. et al.
Features and models for human activity recognition
Neurocomputing
(2015)
IgnatovA.
Real-time human activity recognition from accelerometer data using convolutional neural networks
Appl. Soft Comput.
(2018)
LiuM. et al.
Depth context: A new descriptor for human activity recognition by using sole depth sequences
Neurocomputing
(2016)
KhemchandaniR. et al.
Robust least squares twin support vector machine for human activity recognition
Appl. Soft Comput.
(2016)
ChavarriagaR. et al.
The opportunity challenge: A benchmark database for on-body sensor-based activity recognition
Pattern Recognit. Lett.
(2013)
BanosO. et al.
Window size impact in human activity recognition
Sensors
(2014)
BullingA. et al.
A tutorial on human activity recognition using body-worn inertial sensors
ACM Comput. Surv.
(2014)

SoleimaniE. et al.

Cross-subject transfer learning in human activity recognition systems using generative adversarial networks

Neurocomputing

(2019)

HeJ. et al.

Weakly supervised human activity recognition from wearable sensors by recurrent attention learning

IEEE Sens. J.

(2018)

ZagoruykoS. et al.

Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

(2016)

T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, C. Zhang, Disan: Directional self-attention network for rnn/cnn-free...

H. Ma, W. Li, X. Zhang, S. Gao, S. Lu, Attnsense: Multi-level attention mechanism for multimodal human activity...

M. Zeng, H. Gao, T. Yu, O.J. Mengshoel, H. Langseth, I. Lane, X. Liu, Understanding and improving recurrent networks...

WangK. et al.

Attention-based convolutional neural network for weakly labeled human activities’ recognition with wearable sensors

IEEE Sens. J.

(2019)

J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and...

Z. Gao, J. Xie, Q. Wang, P. Li, Global second-order pooling convolutional networks, in: Proceedings of the IEEE...

S. Woo, J. Park, J.-Y. Lee, I. So Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European...

Cited by (89)

A hybrid attention-guided ConvNeXt-GRU network for action recognition
2024, Engineering Applications of Artificial Intelligence
In the digital age, with the continuous emergence of large-scale video data, video understanding has become increasingly important. As a core domain, action recognition has garnered widespread attention. However, video exhibits high-dimensional properties and contains human action information at multiple scales, which makes conventional attention mechanisms difficult to capture complex action information. To improve the performance of action recognition, a Hybrid Attention-guided ConvNeXt-GRU Network (HACG) is proposed. Specifically, a Novel Attention Mechanism (ANM) is constructed by integrating a parameter-free attention module into ConvNeXt, enabling the preliminary extraction of important features without the addition of extra parameters. Then, a Multiscale Hybrid Attention Module (MHAM) adopts an improved and efficient Selective Kernel Network (SKNet) to adaptively calibrate channel features. In this way, the module enhances the model’s ability to perceive features at different scales while improving the correlation between channels. Furthermore, MHAM incorporates an Atrous Spatial Pyramid Pooling (ASPP) to extract local and global information from different regions. Finally, MHAM is integrated with the Gated Recurrent Unit (GRU) to capture the interdependence between space and time. Experimental results show that HACG exhibits superior competitiveness compared with the state-of-the-art on the UCF-101, HMDB-51, and Kinetics-400 datasets. This indicates that HACG can more effectively capture important features to suppress noise interference while also having a lower computational load, which makes HACG a highly promising choice for action recognition tasks.
An efficient and lightweight multiperson activity recognition framework for robot-assisted healthcare applications
2024, Expert Systems with Applications
Aging is inevitably associated with a decline in physical abilities and can pose challenges to the social lives of elderly individuals. In long-term care facilities, group exercise is instrumental for keeping elderly residents physically and socially healthy. Accommodating these needs in elderly care can be challenging due to staff shortages and other lacking resources. A robotic exercise coach could be helpful in such contexts. Intelligent human–robot interaction requires accurate and efficient human activity recognition. Several solutions focusing on human activity recognition in healthcare robotics have been proposed. However, multiperson activity recognition remains a challenging task in case of using vision-based or wearable sensors data, and past research has mainly focused on single-person rather than multiperson or group activity recognition. Moreover, the existing state-of-the-art methods for activity recognition mainly use heavyweight Convolutional Neural Network (CNN) models to achieve good accuracy. However, these models have certain drawbacks, such as requiring significant computational resources, higher memory and storage needs, and slower inference times. Another challenge is the limited number of publicly available datasets containing few activities for physical activity recognition. In this work, we propose a lightweight, deep learning-based, multiperson activity recognition system for group exercise training of elderly persons. Considering the limited publicly available datasets, we curated a new dataset named the Routine Exercise Dataset (RED), comprising 19 routine exercise activities recommended for elderly persons. The RED dataset has 14,440 samples collected from 19 participants and is one of the most extensive datasets of its kind. We evaluated our proposed activity recognition method based on proposed feature extraction modules and a one-dimensional multilayer long short-term memory network on 16 datasets, including 10 publicly available benchmark activity recognition datasets, an RED dataset, a publicly available dataset combined with RED dataset, and four noise-corrupted RED datasets. The results indicate the efficiency of the proposed method for real-time activity recognition compared to the state-of-the-art methods. The proposed method achieved F1-scores of 98.64%, 97.95%, and 99% on large-scale datasets named UESTC RGB-D, NTU RGB+D, and RED, respectively. We also developed a Robot Operating System (ROS)-based application to deploy our proposed system in a social robot and test it in real-life scenarios.
Self-attention deep ConvLSTM with sparse-learned channel dependencies for wearable sensor-based human activity recognition
2024, Neurocomputing
In this study, we propose a novel deep-learning architecture with sparse learning for human activity recognition. The proposed model contains 1D CNNs and LSTM layers with a self-attention mechanism to enhance a substantial number of time points in time-series data for human activity recognition systems. Based on the recent success of squeeze-and-excite (SE) networks, the proposed deep learning model utilizes the SE module to enhance channel-wise interdependencies, which in turn leads to a boost in performance. In addition, we utilized sparse learning to retrain only weak nodes and freeze stronger nodes in a fully connected layer prior to classification layer. Furthermore, we utilized an entropy-inspired formula to find sparsely located weaker nodes and validated our model on various datasets, including Opportunity, UCI-HAR, and WISDM. Herein, we present an extensive analysis and survey of state-of-the-art studies, in addition to our proposed research. For a fair comparison, we evaluated our deep learning architecture using various performance metrics and achieved better results; the proposed model outperformed state-of-the-art algorithms for human activity recognition.
A deep local-temporal architecture with attention for lightweight human activity recognition
2023, Applied Soft Computing
Human Activity Recognition (HAR) is an essential area of pervasive computing deployed in numerous fields. In order to seamlessly capture human activities, various inertial sensors embedded in wearable devices have been used to generate enormous amounts of signals, which are multidimensional time series of state changes. Therefore, the signals must be divided into windows for feature extraction. Deep learning (DL) methods have recently been used to automatically extract local and temporal features from signals obtained using wearable sensors. Likewise, multiple input deep learning architectures have been proposed to improve the quality of learned features in wearable sensor HAR. However, these architectures are often designed to extract local and temporal features on a single pipeline, which affects feature representation quality. Also, such models are always parameter-heavy due to the number of weights involved in the architecture. Since resources (CPU, battery, and memory) of end devices are limited, it is crucial to propose lightweight deep architectures for easy deployment of activity recognition models on end devices. To contribute, this paper presents a new deep parallel architecture named DLT, based on pipeline concatenation. Each pipeline consists of two sub-pipelines, where the first sub-pipeline learns local features in the current window using 1D-CNN, and the second sub-pipeline learns temporal features using Bi-LSTM and LSTMs before concatenating the feature maps and integrating channel attention. By doing this, the proposed DLT model fully harnessed the capabilities of CNN and RNN equally in capturing more discriminative features from wearable sensor signals while increasing responsiveness to essential features. Also, the size of the model is reduced by adding a lightweight module to the top of the architecture, thereby ensuring the proposed DLT architecture is lightweight. Experiments on two publicly available datasets showed that the proposed architecture achieved an accuracy of 98.52% on PAMAP2 and 97.90% on WISDM datasets, outperforming existing models with few model parameters.
Privacy-preserving activity recognition using multimodal sensors in smart office
2023, Future Generation Computer Systems
Developing a human activity recognition (HAR) system for employees is essential to incorporate intelligence into smart office environments, enabling various human-centered applications to enhance employees’ well-being. Although remarkable progress has been made for the HAR in the smart office, several issues still exist, including lacking a privacy-preserving and unobtrusive method and demanding enhanced generalization performance across users. Therefore, a novel privacy-preserving HAR method based on multimodal sensors is investigated in this study, containing an infrared array sensor, a sensing chair, and a triaxial accelerometer built in a smartphone. The effectiveness of different multimodal combinations is examined. Moreover, a deep learning model is developed for multimodal data fusion to enhance the generalization performances across users. The model contains a residual 3D convolutional neural network (CNN) and 1D CNN for learning spatial–temporal feature representation of different modalities. Additionally, external memory units and an adaptive decision fusion operation are utilized for multimodal data fusion. Finally, extensive experiments are conducted to examine the performance of the proposed model using a self-collected dataset and the leave-one-subject-out cross-validation approach. The results verify the effectiveness of the proposed model.
ENGA: Elastic Net-Based Genetic Algorithm for human action recognition
2023, Expert Systems with Applications
Video surveillance and activity monitoring are the practical real-time applications of Human Action Recognition (HAR). A fusion of several Convolutional Neural Network (CNN) architectures has been widely used for effective HAR and achieved impressive results. Feature fusion of multiple pre-trained models also extracts redundant features due to the combinations of identical layers in all CNN architectures. In this study, network-level fusion is proposed, which reduces the possibility of having identical layers throughout the fusion process and helps extract unique features. Three pre-trained models, i.e., NASNetLarge, DenseNet201, and DarkNet53 are selected and analyzed to select the most efficient combinations of layers among these networks. Selected combinations of these networks are fused using five proposed strategies, i.e., sum, max, concatenation, convolutional and bilinear fusion. In the end, a proposed minimized CNN architecture is utilized to extract descriptors, which are optimized using the proposed Elastic Net-based Genetic Algorithm (ENGA) approach. A two-phase hybrid ENGA technique is suggested to pick features using both GA and EN. GA is used in the initial stage to reduce the dimensionality of retrieved features. To eliminate the unnecessary features, EN regularization is put into place in the second phase. The proposed ENGA model is evaluated on four publicly available datasets including UTKinect-Action, MSR-Action3D dataset, Florence3D-Action dataset, and Youtube-8 m, and achieved 99.63%, 99.69%. 98.63% and 91.46% accuracies, respectively.

View all citing articles on Scopus

^☆: This work was supported in part by the National Natural Science Foundation of China under Grant 61203237 and the Joint Project of Industry–University-Research of Jiangsu Province, China under Grant BY2016001-02, and in part by the Natural Science Foundation of Jiangsu Province, China under grant BK20191371.

View full text

DanHAR: Dual Attention Network for multimodal human activity recognition using wearable sensors☆

Highlights

Abstract

Introduction

Section snippets

Related work

Model

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Pattern Recognit. Lett.

Neurocomputing

Knowl.-Based Syst.

Neurocomputing

Appl. Soft Comput.

Neurocomputing

Appl. Soft Comput.

Pattern Recognit. Lett.

Window size impact in human activity recognition

Sensors

A tutorial on human activity recognition using body-worn inertial sensors

ACM Comput. Surv.

Cross-subject transfer learning in human activity recognition systems using generative adversarial networks

Neurocomputing

Weakly supervised human activity recognition from wearable sensors by recurrent attention learning

IEEE Sens. J.

Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

Attention-based convolutional neural network for weakly labeled human activities’ recognition with wearable sensors

IEEE Sens. J.