1 Introduction

Sensor-based human activity recognition (HAR) has been well known for its application in a wide range of contexts, e.g., physical activity monitoring (Wannenburg and Malekian 2017), smart home systems (Zhang et al. 2019), assisted living for elderly (Aramendi et al. 2018), etc. For the realization of sensor-based HAR, supervised learning is a prominent approach due to its ability to make a prediction of well-defined human activities out of the sensor data through a classification model. These supervised learning-based models perform well, provided sufficient labeled data is incorporated. However, it is known that the performance of a HAR system degrades when it faces new samples coming from unseen distributions. For example, a HAR system in a clinic to evaluate the effectiveness of a medication on parkinson disease might be able to accurately detect the performed activities by a patient. However, the performance is not necessarily as accurate when it comes to the recognition of a new patient’s activities due to his/her gait or behavior difference. It means that more samples are needed to address this issue. Nonetheless, collecting more labeled samples from new patient requires another long time observation of human subjects behavior which is time-consuming yet impractical.

To consolidate with the requirement of a substantial amount of labeled data, several domain adaptation methods have been proposed to minimize (or eliminate) the need for labeled data (Long et al. 2015; Gong et al. 2012; Pan et al. 2008; Fernando et al. 2014).

Domain adaptation approaches usually try to find a subspace that minimizes domain discrepancy based on linear projection (Fernando et al. 2014; Long et al. 2015). Nonetheless, a linear projection assumes that the principal angles which explain the data well can be found when there is a linear correlation among variables. However, such an assumption does not always hold in reality. In addition, those approaches do not align the distribution between domains explicitly, which will give missleading classifier decision boundary. Other approaches are based on autoencoder to find a domain-invariant latent representation (Zhuang et al. 2018). The autoencoder is a type of feed-forward neural network that learns a compact latent representation of input samples (typically by introducing a bottleneck layer) and is trained to reconstruct the input samples. An autoencoder-based model has an advantage compared to the linear transformation-based approach since the autoencoder is able to learn a manifold structure that best characterizes the data. Besides, such an approach can also explicitly reduce domain discrepancy through the minimization of some discrepancy measures. The approach proposed by Zhuang et al. (2018) makes use of autoencoder architecture, minimizes domain discrepancy, and incorporates label information. The latent representation of the source and the target domains are learned at once. However, by learning a latent representation of both domains at once, the reconstruction error of two domains will compensate each other. As a result, the reconstruction from both source and target domain samples will remain sub-optimal. During the model training, sometimes the reconstruction error for the source domain samples is low, but not for the target domain samples. Consequently, the learned representation for the target domain samples is not robust enough, leading to poor classification performance.

On one hand, focusing on optimizing the autoencoder for the target HAR domain samples first might result in the optimum latent representation for that particular domain. On the other hand, since the autoencoder is trained on the target domain in the first place, the samples from the source domain will be seen as the atypical samples by the autoencoder. As a result, the reconstruction error for the source domain samples will be high, especially when the samples from the source domain and target domain have very different characteristic, e.g., in terms of distribution. An adaptive approach to lower the source domain samples reconstruction error should be carefully considered.

In this work, we propose a two-phase autoencoder-based domain adaptation approach that learns a latent representation which minimizes the discrepancy between domains. In the first phase, the normality (i.e. the target HAR domain samples) is learned. Since we are able to see source domain samples as the atypical samples and to see target domain samples as the normal samples using the autoencoder, we resort to autoencoder-based anomaly detection to model normality (Wang et al. 2018b). The learning of normal samples independently from the atypical samples in the first phase also ensures that the latent representation for the target domain is the optimal and robust representation since the target domain is our domain of interest. The first phase is the key advantage of a two-phase approach over the existing one-phase approach, i.e., it ensures more optimum representation of the target domain (i.e., by independently modeling normality in the first phase). After modeling the normal samples, the atypical samples (i.e., source domain samples) can be identified by a high reconstruction error when they are fed into the autoencoder. Take the HAR system for parkinson disease patient for example. Suppose there is a new patient. The autoencoder in the first phase aims to model the pattern of the sensor samples from this new patient (target domain), and encode it into a latent representation. Note that the modeled pattern is of the unlabeled sensor samples, which are much cheaper to collect than the labeled samples. The autoencoder model at that moment, of course, is biased toward this new patient, so it cannot reconstruct well the sensor samples from other existing patients, as they are still considered as atypical.

In the second phase, since we know that the autoencoder can identify atypical samples, we apply sample reweighting on the source domain. The purpose of the reweighting is to make the reconstruction error of the source HAR domain samples lower when they are fed into the autoencoder (hereafter the reweighting step is referred as source adaptation). In this way, the atypical samples will no longer be considered as atypical. In the context of this paper, the source domain will be closer to the target domain since the autoencoder can reconstruct the samples from both domains well. During the second phase, the activity labels of the source domain are simultaneously used to learn more discriminative features for classification tasks. Furthermore, following the previous work (Zhuang et al. 2018), we also aim to minimize Kullback-Leibler divergence (KL-divergence) between the domains in the latent representation space. KL-divergence measures how a probability distribution is different from another probability distribution. Minimizing KL-divergence between two domains leads us to make them closer in terms of the probability distribution. As the result, the learned latent representation will be domain-invariant. In context of the parkinson disease patient example, during the second phase, the sensor samples from the existing patients (source domain) are reweighted and adjusted so the latent representation of these samples are similar to those of the new patient (in terms of probability distribution). Moreover, as the source domain has the ground truth label information, the latent representation of the samples from the existing patients can now be used to construct a HAR classifier, to predict the activities of the new patient. To verify the effectiveness of our approach, we tested it in cross-domain sensor-based HAR. The contribution of this paper is twofold:

  • To propose a two-phase autoencoder-based domain adaptation approach that is able to adapt an existing source HAR domain based on the learned latent representation in a target HAR domain by means of learnable source sample weight.

  • To evaluate the proposed approach in sensor-based HAR that depicts a challenging real-life scenario where there is no available labeled data (i.e., only unlabeled data are available) in the domain of interest (i.e., target domain), but there is a possibility to make use of any labeled data in the existing domain (i.e., source domain).

The organization of this paper is as follows. Section 2 reviews existing works on HAR and transfer learning. Section 3 elaborates our proposed approach on the domain adaptation for HAR. Section 4 shows how our approach is validated using real HAR datasets as well as discusses the findings.

Finally Section 5 concludes our work.

2 Related Works

This section briefly reviews existing literature in the area of transfer learning and HAR.

2.1 Human Activity Recognition

Human activity recognition forms the basis for various wellness related applications (Bagaveyev and Cook 2014; Stiefmeier et al. 2008; Meditskos et al. 2018; Kim et al. 2018). Due to the advancement of pervasive sensing devices and its capability to avoid the entailing privacy issue (Ponce et al. 2016), sensor-based HAR has been gaining popularity. Many works on the machine learning-based activity recognition have been proposed to infer activities from sensor data, e.g., accelerometer, gyroscope, magnetometer, RFID, etc. Wannenburg and Malekian (2017) proposed a system for physical activity monitoring using smartphone accelerometer data. Jahn et al. (2017) used raw sensor data from mobile devices to detect the mode of transportations, e.g., car, tram, and bus. In the field of assisted daily living for the older adult, Aramendi et al. (2018) developed an automatic assessment of functional health decline system using passive infra-red (PIR) presence sensors. However, wearable human activity recognition usually requires large amounts of labeled data. In fact, collecting such labeled data is a time consuming and tedious work. The data from existing domain is potentially useful. Nonetheless, traditional supervised approaches are known to degrade in a new environment or domain, where the activity patterns or users (i.e., age, gender, behavior, posture, etc.) are different.

2.2 Transfer Learning

Transfer learning approaches have been proposed to reduce the difference between domains. There are two approaches on the transfer learning, namely instance-based and feature-based transfer. The instance-based transfer learning aims to learn weights in the source domain, by which the re-weighted source domain samples get closer to the target domain samples (DAI 2007; Al-Stouhi and Reddy 2011; Yao and Doretto 2010). The feature-based approach aims to learn a new representation in which the distance between domains is reduced (Gong et al. 2012; Pan et al. 2011; Long et al. 2015; Zhuang et al. 2018; Blitzer et al. 2006; Chen et al. 2012). However, the approaches find the new representation by relying on the linear transformation. As a result such approaches might fail to model the data with more complicated structure. Furthermore, only very few of those approaches that explicitly minimize domain discrepancy, e.g., Transfer Learning with Double encoding-layer Autoencoders (TLDA) (Zhuang et al. 2018) and maximum mean discrepancy embedding (MMDE) (Pan et al. 2008).

The TLDA approach by Zhuang et al. (2018) learns a manifold structure that best characterizes the data and explicitly reduces domain discrepancy through the minimization of some statistical distance using an autoencoder. The approach is a one-phase approach that learns latent representation for both domains at once. Our approach is a feature-based approach by means of autoencoder model. Different from the previous autoencoder-based approach, we consider a two-phase approach. First of all we learn latent representation for the domain of interest (i.e., target domain) independently to ensure more optimum representation of the target domain. Subsequently, the source domain is adapted to make it closer towards the target domain while incorporating the available label information. In the field of sensor-based HAR itself, Wang et al. (2018a) used feature-based approaches to handle domain difference. Yet, it still rely on the approaches which either depends on linearity or neglects explicit domain discrepancy (i.e., distribution difference between domains) minimization. Furthermore, it involves voting mechanism by combining several classfiers. Thus, there are many hyperparameters to tune. Our approach explicitly minimizes domain discrepancy and can model the data with complicated structure since it accommodates nonlinearity. Besides, our approach only need one hyperparameter, i.e., the size of latent representation.

3 Method

This section explains our proposed approach on the domain adaptation, atypical sample regularizer autoencoder (Asura). The overall framework is shown in Fig. 1. As shown in the figure, for the first phase, an autoencoder is trained such that the encoding layer can model a robust representation of the normal samples (i.e., target domain) by minimizing its reconstruction error. In the second phase, the weight matrix is learned to further minimize the reconstruction error of the source domain samples. At the same time, the softmax loss and the KL-divergence between domains are minimized. The minimization of softmax loss is performed to learn the classification function, meanwhile the minimization of KL-divergence is performed to reduce doman discrepancy between the source and the target HAR domains.

Fig. 1
figure 1

Our proposed approach, represented as a two-phase approach

3.1 Problem Definition

Let \( {n}_{\mathsf{s}} \)denote the number of samples of sensor data from a source HAR domain, \( {\mathbf{X}}^{\mathsf{s}}=\left\{{\mathbf{x}}_i^{\mathsf{s}}\mid i=1,\dots, {n}_{\mathsf{s}}\right\} \), and \( {n}_{\mathsf{t}} \) be the number of samples of sensor data from a target HAR domain, \( {\mathbf{X}}^{\mathsf{t}}=\left\{{\mathbf{x}}_j^{\mathsf{t}}|j=1,\dots, {n}_{\mathsf{t}}\right\} \), where \( {\mathbf{x}}_i^{\mathsf{s}},{\mathbf{x}}_j^{\mathsf{t}}\in {\mathbb{R}}^d \). The label information in the source domain is available, i.e., \( {\mathbf{y}}^{\mathsf{s}}=\left\{{y}_i^{\mathsf{s}}|i=1,\dots, {n}_{\mathsf{s}}\right\} \), and \( {\mathbf{x}}_i^{\mathsf{s}} \) corresponds to \( {y}_i^{\mathsf{s}} \). However, the label information in the target domain, i.e., \( {\mathbf{y}}^{\mathsf{t}}=\left\{{y}_j^{\mathsf{t}}\mid j=1,\dots, {n}_{\mathsf{t}}\right\} \), is unavailable. Assuming that the labels in source and target domain are of the same label space, i.e., \( {y}_i^{\mathsf{s}},{y}_j^{\mathsf{t}}\in \mathcal{Y} \), our objective is to accurately predict \( {y}^{\mathsf{t}} \)given \( {\mathbf{X}}^{\mathsf{s}} \), \( {y}^{\mathsf{s}} \), and \( {\mathbf{X}}^{\mathsf{t}} \).

3.2 Phase 1: Modeling the Target Domain

In the first phase, a simple autoencoder is trained to model the target domain by learning its latent representation. The optimum encoding and decoding matrix are obtained by solving the following problem:

$$ \hat{\mathbf{W}},{\hat{\mathbf{W}}}^{\prime },\hat{\mathbf{b}},{\hat{\mathbf{b}}}^{\prime }=\underset{\mathbf{W},{\mathbf{W}}^{\prime },\mathbf{b},{\mathbf{b}}^{\prime }}{\arg \min}\sum \limits_{i=1}^{n_{\mathsf{t}}}\parallel \sigma \left({\mathbf{W}}^{\prime}\sigma \left(\mathbf{W}{\mathbf{x}}_i^{\mathrm{t}}+\mathbf{b}\right)+{\mathbf{b}}^{\prime}\right)-{\mathbf{x}}_i^{\mathrm{t}}{\parallel}^2 $$
(1)

where W ∈ ℝh × d and W ∈ ℝd × h are encoding and decoding weights, respectively. The variable h indicates the size of the latent representation. The terms b and b are bias for encoder and decoder, respectively. The nonlinear function is denoted as σ. In this paper we use the sigmoid function for the nonlinearity, i.e.,

$$ \sigma \left(\mathbf{x}\right)=\frac{1}{1+{e}^{-\mathbf{x}}}. $$

3.3 Phase 2: Source Samples Adaptation

3.3.1 Weight Matrix Learning for Source Samples Reweighting

It is obvious that if two domains are different, the reconstruction error of samples from one domain will be high, according to the autoencoder which was previously trained using the samples from another domain. We propose to learn a weight matrix Ψ, such that when we apply the weight Ψon the source samples \( {\mathbf{X}}^{\mathsf{s}} \) and subsequently apply nonlinearity, the reconstruction error will be reduced. Let \( {\overset{\sim }{\mathbf{x}}}_i^{\mathsf{s}}=\sigma \left(\Psi {\mathbf{x}}_i^{\mathsf{s}}+{\mathrm{b}}_{\mathsf{s}}\right) \) denote a single reweighted sample from source domain, where bs is the bias term for the reweighting. The learning of optimum Ψis formulated as the minimization of the following problem:

$$ {\displaystyle \begin{array}{c}\hat{\Psi},{\hat{\mathbf{b}}}_{\mathsf{s}}=\begin{array}{c}\mathrm{argmin}{\mathcal{L}}_1\\ {}\Psi, {\mathbf{b}}_{\mathsf{s}}\end{array}\\ {}{\mathcal{L}}_1=\sum \limits_{i=1}^{n_{\mathsf{s}}}{\left\Vert \sigma \left({\hat{\mathbf{W}}}^{\prime}\sigma \left(\hat{\mathbf{W}}{\overset{\sim }{\mathbf{x}}}_i^{\mathsf{s}}+\hat{\mathbf{b}}\right)+{\hat{\mathbf{b}}}^{\prime}\right)-{\mathbf{x}}_i^{\mathsf{s}}\right\Vert}^2.\end{array}} $$
(2)

Note that the problem in Eq. 2 is essentially the continuation of the autoencoder training, by fixing the value of the model parameters learned from solving the problem in Eq. 1, i.e., \( \hat{\mathbf{W}} \) , \( {\hat{\mathbf{W}}}^{\prime } \), bencopt, and \( {\hat{\mathbf{b}}}^{\prime } \). By fixing these variables, the reconstruction error for the target domain samples will remain unchanged and small, while making the reconstruction error for the source domain samples smaller using the weight matrix Ψ. Once the weights are optimized, the latent representation for the respective source and target domains can be computed as follows:

$$ {\displaystyle \begin{array}{c}{\mathbf{Z}}^{\mathsf{s}}=\left\{{\mathbf{z}}_i^{\mathrm{s}}|i=1,\dots, {n}_{\mathsf{s}}\right\},\\ {}{\mathbf{Z}}^{\mathrm{t}}=\left\{{\mathbf{z}}_i^{\mathrm{t}}|i=1,\dots, {n}_{\mathsf{t}}\right\},\end{array}} $$
(3)

where \( {\mathbf{z}}_i^{\mathsf{s}}=\sigma \left(\hat{\mathbf{W}}{\overset{\sim }{\mathbf{x}}}_{\mathsf{i}}^{\mathsf{s}}+\hat{\mathbf{b}}\right) \), \( {\overset{\sim }{\mathbf{x}}}_i^{\mathsf{s}}=\sigma \left(\hat{\Psi}{\mathbf{x}}_i^{\mathsf{s}}+{\mathbf{b}}_{\mathsf{s}}\right) \), and \( {\mathbf{z}}_i^{\mathsf{t}}=\sigma \left(\hat{\mathbf{W}}{\mathbf{x}}_i^{\mathsf{t}}+\hat{\mathbf{b}}\right) \). Note that zs ∈ ℝh, zt ∈ ℝh and \( {\overset{\sim }{\mathbf{x}}}^{\mathsf{s}}\in {\mathbb{R}}^d \).

3.3.2 Softmax Weight Learning for Classification

Since the true labels in the source HAR domain are available, we can use them to learn the best feature for classification. A fully connected layer is added, subsequent to the source domain latent representation. Precisely, the output represents a softmax regression, i.e., the generalization of the logistic regression for multi-class classification. The softmax regression weights, Θ = {θ1, …, θc}, θi ∈ ℝh, can be optimized by solving following problem:

$$ {\displaystyle \begin{array}{c}\hat{\Theta}=\begin{array}{c}\arg \min {\mathcal{L}}_2\\ {}{\boldsymbol{\theta}}_1,\dots, {\boldsymbol{\theta}}_c\end{array}\\ {}{\mathcal{L}}_2=-\frac{1}{n_{\mathsf{s}}}{\sum}_{i=1}^{n_{\mathsf{s}}}{\sum}_{j=1}^c\mathbf{1}\left\{{y}_i^{\mathsf{s}}=j\right\}\log \frac{e^{{\boldsymbol{\theta}}_j^{\top }{\mathbf{z}}_i^{\mathsf{s}}}}{\sum_{k=1}^c{e}^{{\boldsymbol{\theta}}_k^{\top }{\mathbf{z}}_i^{\mathsf{s}}}},\end{array}} $$
(4)

where 1{·} is an indicator function, whose value is 1 if the expression is true, otherwise 0, and \( {y}_i^{\mathsf{s}}\in \left\{1,2,\dots, c\right\} \) is the ground truth that represents one of c activities. After obtaining \( \hat{\Theta} \), the prediction of the activity class, i.e., y, on an instance by the softmax classifier can be obtained as follows:

$$ y=\underset{j}{\max}\frac{e^{{\hat{\boldsymbol{\theta}}}_j^{\top}\sigma \left(\hat{\mathbf{W}}\mathbf{x}+\hat{\mathbf{b}}\right)}}{\sum_{k=1}^c{e}^{{\hat{\boldsymbol{\theta}}}_k^{\top}\sigma \left(\hat{\mathbf{W}}\mathbf{x}+\hat{\mathbf{b}}\right)}}, $$
(5)

where \( \hat{\boldsymbol{\theta}}\in \hat{\Theta} \).

3.3.3 KL-Divergence Minimization for Domain Discrepancy Reduction

To enforce 1) the weighted source domain samples to have a similar distribution with the target domain samples; and 2) the latent representation of source domain samples to have a similar distribution with the latent representation of the target domain, we consider the minimization of KL-divergence for both aspects. Let the KL-divergence be defined by Equation 6, where PA and PB are probability distributions.

$$ {D}_{KL}\left({P}_A\parallel {P}_B\ \right)=\sum \limits_i\ln\ \left(\frac{{P_A}_i}{{P_B}_i}\right). $$
(6)

It is worth noting that KL-divergence is asymmetric, which means DKL(PAPB ) ≠ DKL(PBPA ). To convert KL-divergence into its symmetric version, we take the sum of DKL(PAPB ) and DKL(PBPA ) for each pair of probability distributions PA and PB. Hence, the discrepancy between samples coming from two different HAR domains can be calculated as follows:

$$ {\mathcal{L}}_3=\left[{D}_{KL}\left({P}_S\parallel {P}_T\ \right)+{D}_{KL}\left({P}_T\parallel {P}_S\ \right)\right]+\left[{D}_{KL}\left({P}_s\parallel {P}_t\ \right)+{D}_{KL}\left({P}_t\parallel {P}_s\ \right)\right] $$
(7)

where.

$$ {\displaystyle \begin{array}{cc}{P}_S=\frac{P_S^{\prime }}{\sum {P}_S^{\prime }},& {P}_S^{\prime }=\frac{1}{n_{\mathsf{s}}}\sum \limits_{i=1}^{n_{\mathsf{s}}}{\overset{\sim }{\mathbf{x}}}_i^{\mathsf{s}},\\ {}{P}_T=\frac{P_T^{\prime }}{\sum {P}_T^{\prime }},& {P}_T^{\prime }=\frac{1}{n_{\mathsf{t}}}\sum \limits_{i=1}^{n_{\mathsf{t}}}{\mathbf{x}}_i^{\mathsf{t}},\end{array}} $$
(8)

and

$$ {\displaystyle \begin{array}{cc}{P}_s=\frac{P_s^{\prime }}{\sum {P}_s^{\prime }},& {P}_s^{\prime }=\frac{1}{n_{\mathsf{s}}}\sum \limits_{i=1}^{n_{\mathsf{s}}}{\mathbf{z}}_i^{\mathsf{s}},\\ {}{P}_t=\frac{P_t^{\prime }}{\sum {P}_t^{\prime }},& {P}_t^{\prime }=\frac{1}{n_{\mathsf{t}}}\sum \limits_{i=1}^{n_{\mathsf{t}}}{\mathbf{z}}_i^{\mathsf{t}}.\end{array}} $$
(9)

The term [DKL(PSPT ) + DKL(PTPS )] characterizes the discrepancy between domains in the input space (i.e., between the weighted source domain samples and the target domain samples). Meanwhile, the term [DKL(PsPt ) + DKL(PtPs )] characterizes the discrepancy between domains in the latent representation space (i.e., between the latent representation of source domain samples and the latent representation of target domain samples).

3.4 The Overall Optimization Problem

Our approach is a two-phase approach. The first phase is the modeling of the target HAR domain samples, represented as the learning of a robust latent representation of the target HAR domain samples, i.e., Equation 1. The second phase is the source adaptation, which covers the learning of source weight matrix, the minimization classification loss, and the domain discrepancy minimization, summarized as the following problem:

$$ {\mathcal{L}}_{\mathsf{adapt}}={\mathcal{L}}_1+{\mathcal{L}}_2+{\mathcal{L}}_3 $$
(10)

All the minimization problems can be solved by using gradient descent optimization. The gradient of the objective function with respect to all model parameters can be computed efficiently using various available software packages. The Algorithm 1 summarizes the procedure of our approach.

figure d

4 Experiment and Discussion

In this section, we verify the proposed approach. First, it is tested using various latent space dimension sizes to get the best size that maximizes the overall accuracy. Then, we compare our best performing model to the baselines. We also conducted statistical test to emphasize the significance of our approach.

4.1 Dataset

The datasets used in this study are OPPORTUNITY dataset (OPP) (Chavarriaga et al. 2013), Physical Activity Monitoring dataset (PAMAP) (Reiss and Stricker 2012), and daily and sports activities (DSA) dataset (Barshan and Yüksek 2014). All datasets contain motion sensor data, captured using triaxial accelerometer, gyroscope, and magnetometer sensors, regarding several subjects’ activities. The sensors are spread on the subjects’ body. Specifically, each of the sensors is attached to a body part and generates 3-channel sensor data (i.e., x, y, and z axis) which further are derived into 27 features. The datasets, as well as the feature extraction procedure, follow the work in Wang et al. (2018a). The preprocessed version of both datasets are publicly available.Footnote 1 We prepare 11 domain adaptation tasks, each of which resemble a single cross-domain HAR task, as shown in the Table 1. Task 1–4 represents a scenario of the transfer within the same dataset and slightly different body part, e.g., transfer from the left hand to the right hand. Task 5–8 shows the scenario where the transfer occurs within the same dataset but between with very different body parts, e.g., transfer from the hand to the chest. The scenario of cross-dataset with similar body parts (e.g., transfer from the chest in PAMAP dataset to the back in OPP dataset) is represented by the task 9–11.

Table 1 Domain adaptation tasks

4.2 Implementation and Baselines

We use pytorchFootnote 2 python library to implement the proposed architecture. The ADAM optimizer (Kingma and Ba 2014) is used to solve the optimization problem, with the parameters left as default, since the optimizer-specific parameters are beyond the scope of this paper. The experiments are repeated, for h ∈ {10,20,30, …, 100}. We compare our approach with several baselines commonly used in the domain adaptation literature: principal components analysis (PCA), kernel PCA, transfer component analysis (TCA) Pan et al. (2008), geodesic flow kernel (GFK) Gong et al. (2012), transfer kernel learning (TKL) Long et al. (2015), stratified transfer learning (STL) Wang et al. (2018a), and TLDA (Zhuang et al. 2018). Most of the baselines are widely used as the benchmark in the domain adaptation literatures (e.g., Wang et al. (2018a), Sun et al. (2016), (2017), Fernando et al. (2014), etc.). The TLDA approach is based on autoencoder, and thus, is closely related to our approach as a baseline. For each baseline, the parameters are varied according to its original paper, and the best result is reported. The classification accuracy metric is used to measure the model performance, defined as

$$ Accuracy=\frac{TP+ TN}{TP+ FP+ TN+ FN}, $$
(11)

where TP, TN, FP, FN represents “true positive”, “true negative”, “false positive”, and “false negative” respectively.

4.3 Parameter Sensitivity Analysis

Our approach depends on a hyper-parameter, h, that represents the size of latent space dimension. The classification result on the datasets by varying the size of latent space dimension h is shown in Table 2. Interestingly, using only h = 10, our model can achieve the highest overall result. However, the higher h can sometimes make a better latent representation that gives higher accuracies for the more difficult tasks, e.g., task 2 and task 10.

Table 2 Classification accuracy of our approach by varying the dimension of the latent representation

4.4 Comparison with Existing Approaches

Table 3 shows the classification result. Asurawith h = 10 is compared with the existing approaches. The results for PCA, KPCA, TCA, GFK, TKL, and STL, are derived from Wang et al. (2018a). It is shown that our approach has the best accuracy for most of the tasks. The competitive existing approach is STL, where it outperforms our approach in task 2 and slightly outperforms our approach in task 10. In addition, compared to the autoencoder-based approach, i.e., TLDA, the Asuraachieves a better result.

Table 3 Comparison of the classification accuracy with the baselines

We also conducted the two-sample t-test for all tasks to measure the significance of the accuracy using our approach compared to the existing ones. Specifically, for each baseline, all of its accuracies over entire tasks are compared to those of Asurato obtain the p-value. We set the significance level threshold α to 0.05 to determine whether the accuracy difference is significant (i.e., when the resulted p value < α). The result, as shown in the Table 4, suggests that our approach significantly outperforms the existing approaches, i.e., other than the most competing approach (STL), the p-values are in the magnitude of 10−3 to 10−4, indicating the considerable improvement of our approach.

Table 4 The result of t-test, where the p − value describe the improvement significance of our approach compared to the corresponding method

4.5 Discussion

This section discusses the effect of latent representation size parameter, the performance of learned representation by our approach compared to the existing one, and the limitation of this work.

Smaller latent space dimensions generally work well and outperforms the baselines, although some tasks require higher latent space dimension to encode features for better prediction result, e.g., task 2 and 10. One insight that we get is that the tasks with lower accuracies are: 1) the tasks that involved DSA dataset; and 2) the task where DSA dataset is the source. It is an indication that DSA is difficult to transfer with a small size of latent representation. Thus, for such tasks, higher latent space dimension is needed. It is also worth noting that if the difference between source and target datasets is only on the position of the body, e.g., left hand and right hand, the performance tends to be higher, as shown in the result for the task 1–4 in Tables 2 and 3. This is because the variance of activities performed by the left hand is also explainable by the right hand. On the other hand, when the difference between domains gets larger, i.e., in tasks 5–8 (i.e., the same dataset with a different body part and different position), the accuracy gets lower. It is even more challenging when all approaches deal with tasks 9–11 (different dataset).

The representation learned by our learning schema, i.e., a two-phase approach, is able to produce a better domain-invariant latent representation. It is indicated by the softmax classifier, which weights were learned using source domain samples, that it can also generalize on the target domain samples better than one-phase autoencoder-based approach, i.e., TLDA. Specifically, the two-step approach in Asuraoutperforms its one-step approach counterpart in the more difficult tasks, i.e., task 9–11.

Although the overall result shows the superiority of Asura, a limitation remains, i.e., the approach depends on the choice of the latent representation dimension (i.e., h). All in all, indeed, it is difficult to find the one-value-fits-all latent representation dimension, i.e., h, since the end result depends on the characteristics of the dataset itself. However, Asura’s limitation, i.e., the requirement to set a hyperparameter of the size of the domain-invariant feature space, also exists in the other approaches. Besides, compared to the most competing approach, i.e., STL, Asurahas fewer hyperparameters to tune. Note that STL uses the voting mechanism from several base classifiers to determine initial pseudo-labels on the target domain. In fact, the number of classifiers and the choice of each classifier can be considered as a hyperparameter. Asuraalso have fewer hyperparameters than other autoencoder-based approaches, i.e., TLDA. TLDA needs to weight each term in its objective function to control the priority of classification loss, reconstruction error, and the discrepancy measure. In contrary, Asuraomits such weighting to further prune the hyperparameter search space and reduce the need for hyperparameter tuning through model validation. As a matter of fact, in the transfer learning scenarios, the labeled data are expensive, such that obtaining a validation set is hardly an option to tune hyper-parameters. Hence, too many hyperparameters are unfavorable.

We can also notice that the tasks where the source and target are of the same dataset yields relatively high accuracies, since the difference between the source and target domain is only related to the sensor positioning. In contrary, the performance of our approach drops when the source and target domains are from the different dataset. It is undeniable that cross-dataset problem is indeed a more difficult problem, where not only the subjects are different, but also the sensor configurations (sampling rate, sensitivity, displacement, etc.) are different. As the result, the domain discrepancy gets larger, and both domains get more difficult to align. Thus, it is challenging to achieve really high accuracies in such case. In the real situation, it can be occurred, for example, when a HAR system in a clinic is to be replicated in the patient’s home to perform parkinson assessment remotely. The environment setup in the patient’s home including the set of sensors used by the patient might be different from those in the clinic, which causes even larger domain discrepancy.

5 Conclusion

We have proposed a two-phase autoencoder-based domain adaptation approach that is able to use the existing source HAR domain based on the learned latent representation in the target HAR domain. Inspired by autoencoder-based anomaly detection, the approach learns normality and subsequently regularizes the atypical samples in the existing source HAR domain to get closer to the target HAR domain by means of source adaptation. By the source adaptation, the atypical samples become normal, indicating that there is small difference between domains. We tested the Asurain a cross-domain sensor-based HAR scenario that reflects the scarcity of labeled data in the domain of interest while it is possible to make use of any labeled data in the existing domain. The experimental result shows that our model significantly outperforms the existing approaches. However, the Asuradepends on the choice of the latent representation size, which can be considered as a limitation of this work. We consider the automatic selection of the latent representation size as the future work. It would be also interesting to see the applicability of our approach for the application other than HAR. The poor performance caused by overly large domain discrepancy (as shown in the cross-dataset task results) might be cured in the future by introducing new labeled samples in the target domain. For instance, after the initial model deployment using our approach, the online learning scheme can be incorporated. That said, the model can be refined incrementally using new samples (e.g., obtained from the patient in a non-obtrusive manner) that come over time, so the model gradually adapts better in the target domain and yields better accuracy.