Introduction

The exponential growth in online information has ushered in the development and application of recommendation systems (RS). Such systems are instrumental in mitigating the challenge of information overload and finding applications across various platforms like Amazon, Taobao, Jingdong, Facebook, Microblog, etc [1]. RS proactively suggests information or products aligned with user preferences by analyzing their past interaction records. Broadly, the traditional techniques employed in RS can be grouped into two main categories: Content-based methods and Collaborative Filtering (CF) methods [2]. The former recommends items by drawing parallels between them and the user’s historical preferences. In contrast, CF techniques primarily rely on user or item similarity metrics to formulate recommendations. They all relate to machine learning [3].

Collaborative Filtering (CF) methods enjoy widespread adoption due to their straightforward implementation and robust efficiency. A significant proportion of these CF approaches anchor on matrix factorization (MF). However, most existing MF models have defects in nonlinear feature learning. Additionally, the sparsity problem is a major bottleneck for MF. Recently, deep learning methods have shown the power of learning latent feature representation, and it has a wide range of applications in various fields [4,5,6,7]. Inspired by deep learning, many works combine deep learning with MF and propose deep matrix factorization models to address the aforementioned shortcomings and have demonstrated commendable outcomes. Autoencoders stand out as a favored approach in this domain [8,9,10]. For instance, Sedhain et al. employed an autoencoder to learn feature representations of either items or users to Implement collaborative filtering [8]. Another popular method is the Multi-Layer Perceptron (MLP) [11, 12]. Fan et al. innovated by leveraging multi-layer neural networks to learn the latent representations of users or items to approximate the nonlinear latent variable model [11]. This model is more versatile and applicable to various tasks of matrix completion and has achieved satisfactory results.

Most methods that introduce auxiliary information from users and/or items have demonstrated significant prowess in addressing the sparsity challenge [13]. Zhang and Zou et al. introduced additional item/user attributes on the collaborative filtering model to improve the accuracy of rating prediction [14, 15]. However, due to the high cost of obtaining label knowledge and the increment awareness of user privacy protection, it is difficult to capture the auxiliary information of items/users attributes. He and Man et al. take advantage of rating information from closely correlated domains using transfer learning methodologies for recommendations [16, 17]. Yet, transfer learning often falls short when there are large differences in the structure of scoring data across different domains. Recently, autoencoder-based methods have been widely used in recommendation systems for the superiority of no-label requirement and fast convergence speed. some works employ semi-autoencoder to co-embed the item’s attribute information and the graph features of the items for rating prediction to improve prediction results [9, 10, 18,19,20]. Nonetheless, these models ignore the exploration of the latent feature representation, which curtails their optimization efficiency. Moreover, many existing recommendation models lack versatility and efficiency in expanding the feature space with user or item auxiliary information.

To address the challenges of the various models mentioned above, this paper proposes a novel model, called deep matrix factorization via feature subspace transfer (FSTDMF). Specifically, this paper introduces the semi-autoencoder [14] model to initialize the features of the item from the user-item rating matrix and integrate it into the FSTDMF model to facilitate advanced latent feature representation learning. Simultaneously, this work incorporates a subspace projection distance regular term to transfer the latent feature representations of the item’s auxiliary information to the target task. Remarkably, FSTDMF is adept at harnessing auxiliary information from diverse sources, encompassing attribute data and rating information from other domains, etc. The key contributions to this paper are summarized as follows:

  • To solve the low generality and low efficiency in transferring the auxiliary information of users or items to expand the feature space, this paper takes the subspace projection distance between the item auxiliary information latent feature and the item latent feature as the penalty term to realize the feature transfer.

  • To solve the limitation of simple feature initialization method on model capability, the proposed model uses semi-autoencoder to initialize latent feature representation, and integrates it into FSTDMF model to obtain better latent feature representation and improve recommendation performance.

  • The experimental results conducted on several public real-world datasets demonstrate the effectiveness of the proposed method for capturing more powerful feature representations compared with the start-of-the-art methods.

The remainder of this paper is organized as follows. In “Related work” section, the related work is briefly reviewed, and then the model of FSTDMF is proposed in “FSTDMF model” section. In “Experiments” section, The experimental setup and the experimental results of five real-world datasets are provided in detail. In “Conclusion” section, The proposed model is summarized, and future work is introduced.

Related work

Recommendation systems aim to gauge users’ preferences for items and suggest those they might appreciate [21]. In the initial stages of recommendation system research, collaborative filtering emerged as the most popular and broadly utilized method [22]. Previous collaborative filtering methods can be divided into two categories, memory-based and model-based methods [23]. Memory-based methods are further subdivided into user-based and item-based methods. While both methods leverage rating data to determine the similarity between users or items, their principles are similar. The other category is model-based methods, and the representative technique is matrix factorization (MF) [24]. MF decomposes the rating matrix into a user’s and an item’s matrix and then makes predictions based on the two decomposed matrices. Singular value decomposition (SVD) stands out as a renowned matrix factorization model [24]. It decomposes the rating matrix into the inner product of two low-dimensional feature matrices and then makes predictions based on the reduced-dimensional matrix. An enhancement of SVD, known as SVD++, integrates implicit feedback [24]. While collaborative filtering offers effective recommendations in data-rich scenarios, its efficacy wanes in data-sparse situations.

The current methods to solve the problem of data sparsity include deep matrix factorization (DMF) methods [25] and methods that integrate auxiliary information pertaining to users and items [26]. At the core of DMF, methodologies lies the adoption of deep learning to construct a nonlinear model, transcending the confines of traditional linear MF paradigms. Coinciding with the rapid advancements in deep learning, a lot of deep architecture-based models have found their application in recommender systems. This includes but isn’t limited to multi-layer perceptrons (MLP) [27, 28], convolutional neural networks (CNN) [29, 30], recurrent neural networks (RNN) [31, 32], adversarial networks (AN) [33], autoencoders (AE) [34,35,36,37] among other evolving architectures.

Another way to mitigate data sparsity involves integrating auxiliary information related to users and/or items. Strub et al. proposed a collaborative filtering network model that combines additional information into an autoencoder for rating matrix completion [38]. To utilize the additional information flexibly, Zhang et al. innovated an autoencoder structure, subsequently unveiling a semi-autoencoder model tailored for both rating prediction and personalized top-n recommendations [14].

In addition, cross-domain recommendation presents another avenue for the inclusion of auxiliary user and item information [39]. Pan et al. utilized the user embedding learned in the source domain to initialize the user embedding in the target domain and restrict them to being closed [40]. Man et al. proposed an Embedding and Mapping model (EMCDR) for cross-domain recommendation to use auxiliary information [17]. The introduction of external additional information is also one of the important means to solve the data sparsity.

Despite the contributions of the aforementioned methods, they exhibit shortcomings in their initialization processes, subsequently curbing their model performance. Furthermore, these techniques, when introducing auxiliary information, often lack versatility and efficiency. For instance, works [14, 38] are tailored specifically for item-side information, while works [17, 39, 40] emphasize item rating data from other domains, suggesting a lack of versatility in their approaches. In contrast, this approach diverges significantly. this work employs a semi-autoencoder model to initialize the latent feature representation of items derived from the user-item rating matrix, amalgamating this into this model to facilitate advanced latent feature learning. Additionally, inspired by literature [41], this study discerned that by utilizing subspace projection distances, this work can seamlessly incorporate diverse auxiliary information, achieving enhanced outcomes through merely elementary learning from this auxiliary information. Specifically, this study transforms item attribute data or denser item rating data from other fields into a low-rank latent feature representation. This is then approximated using subspace projection distance against the target task’s latent feature representation, enabling the target task to effectively learn from the auxiliary information.

FSTDMF model

Problem setting

Suppose there is a partial observed rating matrix \(R \in R^{m \times n}\) representing ratings from m users for n items. Additionally, there is a binary matrix \(\psi \in R^{m \times n}\), where the entry \(\psi _{ij}\) is equal to 1 if \(R_{ij}\) is observed and 0 otherwise and auxiliary information about the items is available. This auxiliary information might derive from item attributes, item ratings in other relevant domains, and so forth. Given this item auxiliary information, the aim is to ascertain its latent feature representation \(V_A \in R^{n \times d}\), which behaves as a partial orthonormal matrix, i.e., \(V_A^T V_A=E\). This paper presents two methodologies to derive \(V_A\), the specific procedures are detailed in “Learning latent representations from item auxiliary information” section. Then according to \(R, V_A\), the objective is to secure an effective latent feature representation \(V \in R^{n \times r}\) that allows us to reconstruct the missing entries in R subsequently enabling the prediction of all user ratings for every item.

FSTDMF framework

The proposed FSTDMF framework is depicted in Fig. 1. First, this model takes the scoring matrix as input of a semi-autoencoder model to initialize the latent feature representation V of the item. In addition, this study employs two distinct methods(onehot coding, and a semi-autoencoder model) to extract the latent feature representation \(V_A\) from the item’s auxiliary information A (comprising both item attribute details and item rating matrices from other domains). Subsequently, the initial feature matrix V of the item is obtained by semi-autoencoder as the input of the DMF model. and the regular term (10) is added to the objective function of the model to transfers the features from \(V_A\) to V in pursuit of a superior V which in turn facilitates improved user-item rating predictions. In the following subsections, each component will be discussed in detail.

Fig. 1
figure 1

The framework of FSTDMF model

DMF model combined semi-autoencoder

Matrix completion is one matrix factorization method, the low-rank assumption of conventional matrix completion methods indicates that R is from a linear latent variable model, i.e.,

$$\begin{aligned} R=U V^T \end{aligned}$$
(1)

where \(U \in R^{m\times r}\) denotes the latent feature representation of users. The target function can be expressed as:

$$\begin{aligned} \min _{U,V} \left\| \psi \odot \left( {R}-{UV^T}\right) \right\| _{{F}}^2+ \lambda \Vert {V}\Vert _{{F}}^2+ \beta \Vert {U}\Vert _{{F}}^2 \end{aligned}$$
(2)

In Eq. (2), \(\lambda \) and \(\beta \) are regularization parameter, \(\odot \) denotes the Hadamard product. By solving Eq. (2), the missing entries can be recovered.

Fan et al. [11] proposed to use multi-layer neural networks to learn the latent feature representations of users or items to approximate the nonlinear latent variable model instead of Eq. (2):

$$\begin{aligned}&\min _{V, W, b} \frac{1}{2 n} \sum _{i=1}^n\left\| \psi _i \odot \left( r_i-g\left( W \hat{v}_i+b\right) \right) \right\| ^2 \nonumber \\ {}&\quad +\frac{1}{2 {n}} \lambda \Vert V\Vert _F^2+\frac{1}{2} \beta \Vert W \Vert _F^2 \end{aligned}$$
(3)

where \( W \in R ^{m \times r} \) denotes the weight matrix, \( b \in R^m\) denotes the bias vector, \(\psi _i\) is the ith column of \(\psi \), \(\hat{v}_i\) is ith column of \(V^T\), \(r_i\) is the ith column of R, and \(g(\cdot )\) denotes the activation function such as sigmoid function or hyperbolic tangent function.

It is well known that multi-layer neural networks are often more effective than single-layer neural networks in approximating nonlinear functions. Therefore, the final target function (3) can be further approximated as:

$$\begin{aligned}&\min _{V, W, b} \frac{1}{2 n} \sum _{i=1}^n \Vert \psi _i \odot \left( r_{i}-g^{(K+1)}\left( g ^ { ( K ) } \left( \ldots g^{(1)}\left( {\hat{v}}_i, \Theta ^{(1)}\right) \right. \right. \right. \nonumber \\ {}&\qquad \left. \left. \left. \ldots , \Theta ^{(K)}\right) , \Theta ^{(K+1)}\right) \right) \bigg \Vert ^2+\frac{1}{2 {n}} \lambda \Vert V\Vert _F^2\nonumber \\&\qquad +\frac{1}{2} \beta \sum _{j=1}^{K+1}\Vert W^{(j)} \Vert _F^2 \end{aligned}$$
(4)

where \(g^{(j)}=(W^{(j)}t+b^{(j)})\), \(\Theta ^{(j)}=\left\{ W^{(j)},b^{(j)}\right\} \). \(j = 1, 2, K + 1\), and K is the number of hidden layers. By solving Eq. (4), V can be obtained and then the missing entries of R can be recovered.

While the initialization of the latent feature representation V is constructed using a Gaussian distribution, this can restrict the model’s learning capability, making it challenging to achieve a better latent feature representation. To solve this problem, this work combines the semi-autoencoder with DMF to improve the model learning capability.

Fig. 2
figure 2

The framework of semi-autoencoder model

Semi-autoencoder is an unsupervised model that attempts to learn a compact representation of the input in the output layer [14]. The semi-autoencoder is generally composed of a three-layer network, it does not require the dimensions of the input layer and the output layer to be equal, which is shown in Fig. 2. Given the input \(\left\{ x_1,x_2,x_3,\ldots x_n\right\} \), where, \(x_i \in R^m\), the semi-autoencoder tries to learn a function \(f_{Q,p}(x)\approx x\). The encoder and decoder of a semi-autoencoder are expressed as follows:

$$\begin{aligned} \xi _i&=f(Q x_i+p) \end{aligned}$$
(5)
$$\begin{aligned} \tilde{x}_i&={g}\left( \tilde{Q} \xi _i+\tilde{p}\right) \end{aligned}$$
(6)

where \(Q\in R^{r\times m}\)and \(\tilde{Q} \in R^{m\times r}\) are the weight matrixes, \(p\in R^{r}\) and \(\tilde{p} \in R^{m}\) are the bias vectors, \(\xi _i\in R^{r}\) is the output of encoder, \(\tilde{x}_i\) is the result of \(x_i\) refactoring, f and g are nonlinear activation functions. The aim of the semi-autoencoder model is to minimize the reconstruction error by learning the parameters \(Q,\ p,\tilde{Q}\), and \(\tilde{p}\) as Eq. (7):

$$\begin{aligned}&\min _{Q, \tilde{Q}, p, \tilde{p}} \sum _{i=1}^n\left\| x_i-\tilde{x_i}\right\| ^2=\min _{Q, \tilde{Q}, p, \tilde{p}} \nonumber \\&\qquad \sum _{i=1}^n\left\| x_i-g\left( \tilde{Q}\left( f\left( Q x_i+p\right) \right) +\tilde{p}\right) \right\| ^2 \end{aligned}$$
(7)

Semi-autoencoders have found extensive use in recommendation systems with promising results [25]. For instance, Dong et. al. [9] employs a semi-autoencoder to co-embed the attributes and the graph features of the items for rating prediction. Geng et al. [18] extract auxiliary information from DBpedia and leverage the LSI model to learn hidden relations on top of item features, and subsequently combine them with the original rating matrix and side information, fed into a semi-autoencoder for recommendation prediction.

In this paper, the semi-autoencoder is solely employed to derive the latent feature representation V and not for output prediction. The sigmoid function is applied as the encoding activation function. This process can be expressed as:

$$\begin{aligned} \hat{v}_i=f\left( Q r_{i}+p\right) \end{aligned}$$
(8)

By integrating it into the DMF model, a better initialization of latent feature representation V is achieved. Finally, the prediction results are obtained by solving Eq. (4).

However, solely relying on rating data limits the model’s predictive efficacy, particularly when the rating matrix is sparse. To mitigate this issue, the introduction of a subspace projection distance has been proposed to incorporate auxiliary information, thereby enhancing predictive accuracy. The details will be described in “DMF model via feature subspace transfer” section.

DMF model via feature subspace transfer

Assuming the latent feature representation \(V_A\) is obtained from the auxiliary information, the latent feature can be transferred to the target domain. In general, a regularization term \(Sim(V, V_A)\) can be added to the optimization model to transfer the auxiliary features \(V_A\). The regularization term \(Sim(V, V_A)\) is a measurement to represent the similarity between V and \(V_A\).

This paper uses the subspace projection distance to measure the similarity \(Sim(V, V_A)\). Denote the ith column of V by \(v_i\). The projection distance from \(v_i\) to the subspace spanned by \(V_A\) can be defined by [41]:

$$\begin{aligned} {\text {dist}}\left( v_{i}, V_{{A}}\right) =\left\| v_{i}-V_A V_A^{{T}} v_{i}\right\| _{{F}} \end{aligned}$$
(9)

where \(V_A\) is a partial orthonormal matrix. Hence, the similarity Sim\((V,V_A)\) can be defined by:

$$\begin{aligned} \textrm{Sim}\left( V, {~V}_{{A}}\right)&=\sum _{{i}=1}^{{n}} dist^2\left( v_{i}, {~V}_{{A}}\right) =\sum _{{i}=1}^{{n}}|| v_{i}-{V}_{{A}} {V}_{{A}}^{{T}} v_{i}\Vert _{{F}} ^2 \nonumber \\ {}&=|| V-{V}_{{A}} {V}_{{A}}^{{T}} V\Vert _{{F}} ^2= \Vert {V}\Vert _{{F}}^2-\Vert {V}_{{A}} {V}_{{A}}^{{T}} {V} \Vert _{{F}}^2 \end{aligned}$$
(10)

Adding the similarity \(\textrm{Sim}\left( V, {~V}_{{A}}\right) \) as a penalty term to the model (4), the target function can be formed as follows:

$$\begin{aligned}&\min _{V, W, b} \frac{1}{2 n} \sum _{i=1}^n \Vert \psi _i \odot \left( r_{i}-g^{(k+1)}\left( g ^ { k } \left( \ldots g ^ { ( 1 ) } \left( \hat{v}_i\right. \right. \right. \right. \nonumber \\&\quad \left. \left. \left. \left. p), \Theta ^{(1)}\right) \ldots , \Theta ^{({k})}\right) , \Theta ^{({k}+1)}\right) \right) \Vert ^2+\frac{1}{2 {n}} \lambda \Vert V \Vert _{{F}}^2 \nonumber \\ {}&\quad +\frac{1}{2} \beta \sum _{{i}=1}^{{k}+1} \Vert {W}^{({j})} \Vert _{{F}}^2 +\frac{1}{2 {n}} \mu \Vert {V}\Vert _{{F}}^{2}-\frac{1}{2 {n}} \mu \Vert {V}_{{A}} {V}_{{A}}^{{T}} {V} \Vert _{{F}}^{2} \end{aligned}$$
(11)

Here, \(\mu \) is regularization parameter, the model (11) can be further simplified as:

$$\begin{aligned}&\min _{V, W, b} \frac{1}{2 n} \sum _{i=1}^n \Vert \psi _i \odot \left( r_{i}-g^{(k+1)}\left( g ^ { k } \left( \ldots g ^ { ( 1 ) } \left( \hat{v}_i\right. \right. \right. \right. \nonumber \\&\quad \left. \left. \left. \left. +p), \Theta ^{(1)}\right) \ldots , \Theta ^{({k})}\right) , \Theta _{({k}+1)}\right) \right) \Vert ^2+\frac{1}{2 {n}} \alpha \Vert V \Vert _{{F}}^2 \nonumber \\ {}&\quad +\frac{1}{2} \beta \sum _{{i}=1}^{{k}+1} \Vert {W}^{({j})} \Vert _{{F}}^2 -\frac{1}{2 {n}} \alpha \eta \Vert {V}_{{A}} {V}_{{A}}^{{T}} {V} \Vert _{{F}}^{2} \end{aligned}$$
(12)

with \( \alpha =\lambda +\mu , \eta =\mu /\alpha \). The confidence parameter \(\eta \in [0,1]\) represents the relatedness between the auxiliary information and target data. A large \(\eta \) means that the auxiliary and target domain share a similar feature structure. If \(\eta =0\), the model reduces to Eq. (4), which means that the auxiliary data can not improve the prediction performance of the target domain.

The optimization problem of FSTDMF in Eq. (12) is nonconvex. Fan et al. [11] propose to find the local minima by nonlinear optimization techniques such as BFGS [42] and improved resilient backpropagation (iRprop+) algorithm [43]. In [44], it was shown that iRprop+ often outperformed other methods in optimizing neural networks and iRprop+ is more efficient than BFGS for large-scale problems. This paper follows Fan et al. [11] to use BFGS to solve DMF when the size of R is relatively small (e.g., m, n < 1000) and use iRprop+ otherwise.

Learning latent representations from item auxiliary information

As previously mentioned, the proposed model processes the provided item auxiliary information A to obtain its latent feature representation \(V_A\).For this purpose, two distinct methodologies are employed.

Items’ auxiliary information typically falls into two categories: item side information and the rating matrices that grade items in various domains. While the item side information encompasses a wide range, including attributes, images, comments, videos, etc., obtaining certain types like images, videos, and comments can be challenging. However, attribute information and rating matrices from different domains are more readily accessible. This work leverages both the item attribute information and the domain-specific rating matrices to extract the latent feature representations.

More precisely, each item comes with its attribute data, such as the year of production and genres. Given that there are multiple categories within each attribute, this work adopts one-hot encoding to handle the attribute data of items. For instance, considering the Movielens dataset, each span of 7 years is treated as a distinct code for the attribute ’year’, while every genre is a separate code for the attribute ’genres’. After encoding, the data is represented as \(C \in R^{n\times d}\). Since it is not orthogonal, the next step is to orthogonalize it. Subsequently, the latent feature representation \(V_A\) of the items’ auxiliary information is obtained. The process is illustrated in Fig. 3.

In contrast to deriving latent features from item attributes, when it comes to the rating matrices of how items are graded in other domains, a feature learning model is used. This approach is reminiscent of cross-domain recommendation methods [39]. Dong et. al. [9] employs a semi-autoencoder to co-embed item attributes with their graph features for more accurate rating predictions. It connects items rating R, items attributes A, and graph structures G of items, feeding them into the semi-autoencoder to learn the hidden representations of the item:

$$\begin{aligned} \epsilon =f(Q \textrm{con}(R,A,G)+p) \end{aligned}$$
(13)

Here, \(\epsilon \in R^{n \times d}\) represents the latent feature representations to be learned. In this context, since A and G are redundant, Eq. (13) simplifies to Eq. (14):

$$\begin{aligned} \epsilon =f(Q R+p) \end{aligned}$$
(14)

This bears similarity to a standard semi-autoencoder, so, by solving Eq. (7), the latent features are obtained from the rating matrices that indicate how items are graded in external domains. Following the regularization process, the latent feature representation \(V_A\) is derived. The process is depicted in Fig. 4.

Fig. 3
figure 3

The process of learning latent feature representations form item attribute information

Fig. 4
figure 4

The process of learning latent feature representations form item attribute information

Experiments

Datasets

This work conducts experiments on four real-world datasets to assess the effectiveness of the proposed FSTDMF model.

Movielens100KFootnote 1: Recognized as a standard benchmark dataset, Movielens100K is frequently utilized to gauge the accuracy of rating prediction algorithms. This dataset comprises 100,000 ratings (ranging from 1 to 5) provided by 943 users for 1682 items, with both users and items having at least 20 interactions. Additionally, it includes attributes for users and items, with item attributes covering genre and release year.

Movielens1MFootnote 2: Serving as an expanded version of the Movielens100K, the Movielens1M dataset is also a popular choice for assessing recommendation systems. It contains 1,000,209 ratings (from 1 to 5) given by 6040 users across 3952 items.

DoubanMovie [45]: This dataset revolves around well-structured movie ratings on Douban. It comprises 1,287,869 ratings (from 1 to 5) offered by 2712 users for 34,893 items, where both users and items have a minimum of 5 interactions.

MovieLens20M [46]: Part of the MovieLens family, the MovieLens20M dataset has 1,462,905 ratings (from 0 to 5) provided by 10,000 users for 9395 items. Both users and items in this dataset have at least 5 interactions.

In the case of only containing item’s attribute information, this study conduct experiments on the Movielens100k and Movielens1M data set to evaluate the performance of the proposed FSTDMF.

Conversely, in scenarios exclusively involving the item rating matrix from other domains, this work conducts experiments on the Movielens100k, Movielens1M, DoubanMovie, and Movielens20M data set. This paper designated two distinct tasks for these datasets:

Task 1: Utilizing MovieLens1M as the source domain and MovieLens100K as the target domain.

Task 2: Employing DoubanMovie as the source and MovieLens20M as the target domain.

For both tasks, this work only performs latent representation learning of auxiliary information for items that exist in both domains simultaneously. The details of these two data sets are summarized in Table 1.

Table 1 Statistics of two tasks in the experiment of transferring the rating matrix of items in other areas

Compared methods

This paper compares proposed FSTDMF with the following baseline methods:

  • PMF [47]: PMF models the latent attributes of users and items using a probabilistic linear model with Gaussian observation noise. It is derived from a probabilistic version of SVD factorization.

  • DMF [11]: DMF is on the basis of a nonlinear latent variable model. It is formulated as a deep-structure neural network, in which the inputs are the low-dimensional unknown latent variables and the outputs are the partially observed variables.

  • NCF [27]: NCF leverages only implicit feedback. It begins with randomly initialized user and item representations and subsequently employs a multi-layer perceptron to learn the user-item interaction function.

  • I-AUTO [8]: The Item-based AutoRec (I-AUTO) model leverages an autoencoder to discern effective item feature representations for recommendation.

  • GraphRec [10]: GraphRec co-embeds users and items in a unified latent space. It harnesses the user-item interaction graph’s Laplacian to glean graph-based attribute features. Remarkably, it relies solely on the rating matrix, and all attribute information is extracted from the structure of the graph.

  • HCRSA [14]: HCRSA breaks the restriction that the dimensions of the autoencoder’s input and output layers be equal and introduces auxiliary information for representation learning.

  • MFSAE [18]: MFSAE extracts auxiliary information from DBpedia and leverages the LSI model to learn hidden relations on top of item features. Finally, combined with the original rating matrix and side information, are fed into a semi-autoencoder for recommendation prediction.

  • PRKG [19]. PRKG draws side information from DBpedia and encodes it into a low-dimensional representation using an autoencoder. It then introduces a semi-autoencoder to blend this side information into the recommendation process.

  • Item-Agrec [9]: Item-Agrec uses a semi-autoencoder to jointly embed the attributes and graph features of the items for rating prediction.

  • CMF [48]: CMF offers a joint matrix factorization technique, where entities’ latent factors are cohesively shared across both source and target domains.

  • EMCDR [17]: EMCDR adopts Matrix Factorization (MF) to learn embedding first and then utilize a network to bridge the user embeddings from the auxiliary domain to the target domain.

  • PTUPCDR [49]: PTUPCDR learned a meta-network fed with users’ characteristic embeddings to generate personalized bridge functions to achieve personalized transfer of preferences for each user.

Here, PMF belongs to the ordinary matrix factorization and only the rating matrix is used. DMF, NCF, and I-AUTO belong to the deep matrix factorization and only the rating matrix is used. GraphRec, HCRSA, MFSAE, PRKG, and Item-Agrec belong to the deep matrix factorization and the rating matrix and item attribute information are used, for GraphRec, this paper uses GraphRec1 to denote GraphRec with item attribute information, GraphRec2 denotes GraphRec with item attribute information and graph. CMF, EMCDR, and PTUPCDR belong to the cross-domain recommendation.

Table 2 Parameter settings for each algorithm

Parameter settings and evaluation metrics

In the experiments, this study sets the same parameter values of the compared methods as those in the cited paper if the data sets are the same as those used in the proposed method. If the compared methods are not used for the data sets used in this paper, the parameter values for optimal results are selected through repeated experiments. For the proposed model, the parameter values for optimal results are selected through repeated experiments. In addition, the autoencoder includes an input layer and hidden layer, DMF model includes an input layer, hidden layer, and output layer. Since we want to use the output of the autoencoder hidden layer as the input to the DMF model, the output dimension of the autoencoder hidden layer is the same size as the input dimension of the DMF model. The dimension size of the hidden layer of the DMF model is selected in the range that it is larger than the feature dimension of the project and smaller than the number of users. The parameter values for the compared methods and proposed method are presented in Table 2.

This paper uses Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), which are commonly used in the recommendation, to evaluate the experimental effects of the model mentioned. Their expressions are as follows:

$$\begin{aligned} \textrm{MAE}=\frac{\sum _{(i, j) \in \omega _{\text{ test } }}\left| R_{i j}-\tilde{R}_{i j}\right| }{\left| \omega _{\text{ test } }\right| } \end{aligned}$$
(15)
$$\begin{aligned} \textrm{RMSE}=\sqrt{\frac{\sum _{(i, j)\in \omega _{\text{ test } }}\left( R_{i j}-\tilde{R}_{i j}\right) ^2}{\left| \omega _{\text{ test } }\right| }} \end{aligned}$$
(16)

Here \(\omega _\textrm{test}\) denote the indices of test entries, \(\tilde{R}_{i j}\) denote the predicted result. Obviously, the more accurate the prediction, the lower value of MAE and RMSE.

Experimental results

This work conducted experiments on four datasets: Movielens100k, Movielens1M, Movielens1M-Movielens100K, and DoubanMovie-Movielens20M. The aim was to highlight the enhancements of the proposed model when introducing both item attribute information and the item’s rating matrix from other domains. For the rating matrix in other domains, this work allocated 90% for training and the remaining for testing to obtain \(V_A\). For the target data set, this work experimented with 70%, 50%, and 30% as training data, using the remainder for testing. The average results of all models with five repeated experiments on the data sets are reported in Tables 3, 4, 5, and 6.

From the results, it’s evident that PMF consistently lags behind other methods across all datasets. This performance gap can be attributed to the datasets’ non-linear structure, which contrasts with PMF’s linear modeling approach. In contrast, deep matrix factorization, being non-linear, is aptly suited for handling such data.

As it can be seen in Table 3, GraphRec, HCRSA, MFSAE, PRKG, and FSTDMF outperform DMF, NCF, and I-AUTO on the Movielens100K data set. This demonstrates the usefulness of transferring knowledge from the item attribute information. In addition, GraphRec2 achieves the best result except for FSTDMF, which shows the effectiveness of using the Laplacian function of the user-item interaction graph to gather graph-based attribute features. Notice, when the training set is 30%, I-AUTO falls behind DMF and NCF. This is because that I-AUTO is based on the autoencoder model, and the performance of the model is limited when dealing with a sparse training set. Compared to DMF, the MAE and RMSE results of FSTDMF were improved by 3.8% and 3.2%, 4.2% and 3.7%, 3% and 2.8% in different training set and test set partition cases, respectively. And FSTDMF achieved the best results in all different cases. Notably, FSTDMF can achieve better improvements than GraphRec2 when the training set is 50% and 30%. This is because that when the training set is more sparse, FSTDMF can still effectively utilize the item auxiliary information.

Table 3 Performance of various algorithms on the Movielens100k data set

In Table 4, it can be seen that for the Movielens1M data set, the methods GraphRec, HCRSA, MFSAE, and FSTDMF outperform DMF, NCF, and I-AUTO. This indicates that incorporating attribute information can still enhance recommendation performance on a large and sparse data set. In addition, GraphRec still achieves the best results with the exception of FSTDMF. This demonstrates the effectiveness of leveraging the semi-autoencoder to combine item auxiliary information. However, it is worth noting that the performance of GraphRec2 is similar to that of GraphRec1. This suggests that utilizing the Laplacian function of the user term interaction graph to gather graph-based attribute features does not significantly improve the model’s performance in this data. Notice that FSTDMF can achieve the best result in all cases. Moreover, compared to DMF, the MAE and RMSE results of FSTDMF were improved by 3.3% and 3.2% with the training set is 30%. This indicates that FSTDMF can still effectively utilize the item auxiliary information when the data set is large and sparse.

Table 4 Performance of various algorithms on the Movielens1M data set

In Table 5, it is evident that CMF, EMCDR, PTUPCDR, and FSTDMF clearly outperform DMF, PMF, and NCF on the Movielens1M-Movielens100K data set. These results demonstrate the usefulness of transferring knowledge from the rating matrix on how the item is graded in other areas. On the other hand, PTUPCDR outperformed CMF, which illustrates that the efficiency of transfer learning using a personalized bridge generated through a meta-network. Notably, CMF is based on a transfer learning version of the MF. However, a notable standout is that Item-Agrec clearly outperforms CMF, EMCDR, and PTUPCDR. This can be attributed to Item-Agrec’s advanced neural network model and the introduction of auxiliary item information. FSTDMF achieved the best results in all cases. In a comprehensive review of Table 5, the MAE and RMSE results of FSTDMF were increased by 3.7% and 3.3%, 4.4% and 3.9%, 3.9% and 3.6% compared with DMF respectively. This indicates that FSTDMF can effectively extract and transfer the features of items from the auxiliary domain to the target domain, regardless of the sparsity of the training set.

As it can be seen in Table 6, CMF, EMCDR, and PTUPCDR are clearly inferior to other methods. This may be due to the different rating scales of the DoubanMovie and Movielens20M data set, which may lead to a bias between the features extracted form the auxiliary and target domain by these approaches. In this case, these methods can not achieve better results. In a comprehensive review of Table 6, FSTDMF achieves relatively good experimental results in most cases. It is worth noting that, when the training set is 30%, the MAE and RMSE results of FSTDMF were improved by 6.5% and 6.5% over DMF. This proves that FSTDMF can efficiently extract the item’s features from the auxiliary rating matrix in different rating scales. The subspace projection distance measurement between the auxiliary and target domain can eliminate the influence of different rating scales.

Table 5 Performance of various algorithms on the Movielens1M-Movielens100K data set
Table 6 Performance of various algorithms on the DoubanMovie-Movielens20M data set

In addition, 10 repeated experiments were conducted on the Movielens100k data sets to verify the validity of the FSTMF model by double-tail t test with a significance level was 5%. This paper presents two hypotheses:

\(H_0\): the performance difference between FSTDMF and the comparison method is not significant.

\(H_1\): the performance difference between FSTDMF and the comparison method is significant.

When \(H_0\) is rejected and \(H_1\) is accepted, the advantages and disadvantages of the method are judged by comparing the mean.

The experimental results are shown in Table 7, where ’+’ indicates that the MAE value obtained by FSTDMF is superior to the statistical significance distinction results of other methods in the double-tail t test. As can be seen from Table 7, FSTDMF were significantly different(\(p<0.05\)) from other methods in three different training set test set partition cases. Therefore, the \(H_0\) is rejected and \(H_1\) is accepted, which shows that FSTDMF presents significant changes compared with other methods. It can be seen that FSTDMF achieves the best mean in all cases, which proves that FSTDMF is statistically significantly better than the contenders. Notice that the experimental phenomenon on Movielens1M and DoubanMovie-Movielens20M datasets are the same as those on Movielens100k. For brevity, we do not give the result of these two data sets.

Table 7 Statistical analysis results of various algorithms on the Movielens100k dataset

Parameter sensitivity

To show the effect of different parameters on the prediction precision, this study conducts experiments on several data sets. The specific experimental results will be presented in the next.

trade-off parameter \(\eta \)

The trade-off parameter \(\eta \) is important for the FSTDMF model since it determines how large the impact of the auxiliary data will be on the target data. To show the effect of \(\eta \) on the prediction precision, this work conducts experiments on the Movielens100k data set and the Movielens1M-Movielens100K data set with different \(\eta \). For both sets, this work chose 70%, 50%, and 30% as the training set, and the rest as the test set. The prediction performance is shown in Fig. 5.

As can be seen, The value of MAE and RMSE decreases monotonically as the value of \(\eta \) increases monotonically on both data sets. And the FSTDMF model achieves the best result with \(\eta \) = 1. This is because the target and auxiliary domains are quite related in the experiments. The item knowledge learned from the auxiliary data can make a positive impact on the target data with large tradeoff parameters. Note that when \(\eta \) decreases to 0, the feature subspace approach model reduces to the DMF model with semi-autoencoder (4). Hence the prediction precision of FSTDMF decreases greatly when \(\eta \) is close to 0.

Fig. 5
figure 5

Influence of parameter \(\eta \) size on FSTDMF prediction error

Number of item feature dimension

As different models have different sensitivities to the latent feature representations, it is curious to see how many latent feature representation dimensions are beneficial to the recommendation task. Towards this end, this work conducts experiments on the Movielens1M and DoubanMovie-Movielens20M data sets with different latent feature representation dimensions. The prediction performance is shown in Fig. 6.

Fig. 6
figure 6

The influence of different dimensions r on the prediction error of each algorithm in different data sets

As can be seen from Fig. 6a–c, on the Movielens1M data set, when the training set is 70%, the MAE of PMF also increases faster than that of other methods with the increase of dimension. This explains the high sensitivity of PMF to dimension and the low sensitivity of deep matrix factorization to dimension. Compared with NCF and I-AUTO, DMF shows more stable changes under the change of dimensions, and the effect of DMF is almost better, which shows the superiority of the DMF model. For GraphRec, HCRSA, MFSAE, PRKG, and FSTDMF, their change trends are relatively stable, which indicates that the introduction of auxiliary information can reduce the sensitivity of the model to dimensions. When the training set was 50% and 30%, MAE of PMF, DMF, NCF, and I-AUTO showed different changing trends with the increase of dimension, respectively. This shows that the size of the training set affects the sensitivity of the model to dimensionality when no auxiliary information is introduced. For GraphRec, HCRSA, MFSAE, PRKG, and FSTDMF, the variation trend of MAE is the same as when the training set is 70%, which indicates that the introduction of auxiliary information can reduce the influence of training set size on dimensional sensitivity. Overall, FSTDMF is almost always the best in different dimensions for the three different training set sizes, which illustrates its effectiveness. In addition, FSTDMF shows relatively stable results under different dimensions and different training set sizes, which indicates that FSTDMF is almost unaffected by training set size and dimension size, and has stronger stability.

Table 8 The influence of different layers on FSTDMF prediction error in different data sets
Table 9 The influence of feature subspace transfer and semi-autoencoder on different data sets

In Fig. 6d–f, it can be observed that on the DoubanMovie-Movielens20M dataset, PMF once again exhibits higher sensitivity to dimensionality and sparsity compared to DMF. Similarly, NCF is highly sensitive to both dimensionality and training set size. On the other hand, Item-Agrec shows lower sensitivity to dimensionality and training set size due to the incorporation of autoencoders and auxiliary information. It is important to note that the experimental results for CMF, EMCDR, and PTUPCDR are not provided here due to their poor MAE performance. For CMF, when the training set size is 70%, the MAE results are 0.942, 1.261, 1.928, 2.919, 3.471, and 4.2. For EMCDR, the MAE result is 0.862. For PTUPCDR, the MAE results are 0.929, 1.036, 1.136, 1.324, 1.576, and 5.307. These methods exhibit high sensitivity to dimensionality, which is consistent across training set sizes of 50% and 30%. Overall, FSTDMF continues to perform exceptionally well across different dimensions on the DoubanMovie-Movielens20M dataset. Additionally, FSTDMF demonstrates relatively stable results regardless of dimensionality or training set size, indicating greater efficiency and stability compared to the other methods.

Number of layers

The number of layers of neural networks is one of the key factors affecting the performance of neural networks, so the FSTDMF with different hidden layers is further studied in this paper. In this paper, extensive experiments are conducted on the Movielens100k, Movielens1M, and DoubanMovie-Movielens20M datasets to study the proposed model with different numbers of hidden layers. Pay attention, the target score of Movielens100k and Movielens1M-Movielens100k are the same, but the auxiliary information is different, so only Movielens100k is used in this experiment For detailed comparison, Table 8 shows the performance of different layers at different training set ratios.

As can be seen from Table 8, the results indicate that the 2-layer model achieved the best performance among all the data set. This finding is highly encouraging as it demonstrates the effectiveness of utilizing deep models for collaborative recommendations. The improved performance can be attributed to the increased nonlinearity introduced by stacking multiple nonlinear layers. The relatively poor performance of the 1-layer model suggests that a single-layer neural network may not have sufficient depth to achieve optimal learning results. However, it is worth noting that adding more layers beyond two does not appear to yield further improvements. In fact, the 3-layer model even exhibits a decrease in performance. This phenomenon could be attributed to the higher layers of the network causing overfitting, where the model becomes too specialized to the training data and fails to generalize well to unseen data. In summary, the results indicate that a 2-layer model strikes a good balance between capturing nonlinearity and avoiding overfitting, resulting in improved performance for recommendations.

Ablation study and execution time analysis

In this section, the performance of feature subspace transfer (FST) and semi-autoencoder (SA) is compared using four datasets: Movielens100k, Movielens1M-Movielens100k, Movielens1M and DoubanMovie-Movielens20M. In addition, the optimized execution time of each method is analyzed, to ensure fairness, each method’s hyperparameters are consistent on each data set. Since there is little difference in execution time under different proportions, the execution time of each algorithm is the average of its execution time under three different proportions. The experimental results for both methods are presented in Table 9.

As can be seen, the results demonstrate that both DMF (+SA) and DMF (+FST) consistently outperform the standalone DMF across all datasets, indicating their effectiveness. Specifically, on the Movielens100k and Movielens1M-Movielens100k datasets, DMF (+FST) performs better than DMF (+SA), suggesting that feature subspace transfer is particularly advantageous for smaller datasets. On the other hand, for the Movielens1M and DoubanMovie-Movielens20M datasets, although DMF (+FST) improves the performance in terms of MAE and RMSE, the improvement is not significantly higher than that of DMF (+SA). This implies that when dealing with large sparse datasets, initializing latent feature representations using semi-automatic encoders can achieve almost the same level of improvement as feature subspace transfer. Taking an overview of these results, it can be concluded that FSTDMF consistently achieves the best results across all cases, highlighting the effectiveness of combining FST and SA.

It can be observed that DMF exhibits the lowest execution time across all data sets. FSTDMF has a higher time complexity compared to DMF due to the additional computing costs involved in incorporating SA and FST. The execution time of DMF (+SA) is not significantly different from that of DMF. This is because SA is used for initializing the latent features of the item and is not involved in the model optimization process. Notably, the execution time of DMF (+FST) exhibits a slight increase on the Movielens100k and Movielens1M datasets. This is because, on the Movielens100k and Movielens1M datasets, the latent features of the item’s auxiliary information are encoded by One-hot encoding, which takes less extra time. In addition, the execution time is significantly increased for the Movielens1M-Movielens100k and DoubanMovie-Movielens20M. This is because that the feature-extracting process on the auxiliary rating matrix Movielens1M and DoubanMovie require additional execution time. In general, the transfer learning methods are more time-consuming than the non-transfer methods due to the same reason.

Conclusion

This paper proposes a novel deep matrix factorization method with an item feature subspace transfer model for recommendation systems. This approach surpasses prior methods by innovatively integrating item auxiliary information, optimizing both the feature representations and the learning model’s parameters. In the model framework, this work employed one-hot encoding in tandem with a semi-autoencoder model to extract the latent feature representation of item auxiliary information. Through the utilization of subspace projection distance, the latent feature representations are seamlessly transferred into the target task. Additionally, this paper leverages the semi-autoencoder to pre-initialize the latent feature representations of items. Extensive experiments conducted on five real-world datasets show the proposed framework outperforms competing methods in effectiveness.

Although the proposed model can make use of various auxiliary information of the project, its ability will be limited when the auxiliary information of the item is scarce. In addition, the model ignores the use of user auxiliary information, which also limits the ability of the model. In future research, the graphical features of the item will be considered and further combined with the auxiliary information of the user and the item, as well as the graphical features of the user and the item, to improve the predictive performance in the target task.