Abstract

Feature representation learning is a key issue in artificial intelligence research. Multiview multimedia data can provide rich information, which makes feature representation become one of the current research hotspots in data analysis. Recently, a large number of multiview data feature representation methods have been proposed, among which matrix factorization shows the excellent performance. Therefore, we propose an adaptive-weighted multiview deep basis matrix factorization (AMDBMF) method that integrates matrix factorization, deep learning, and view fusion together. Specifically, we first perform deep basis matrix factorization on data of each view. Then, all views are integrated to complete the procedure of multiview feature learning. Finally, we propose an adaptive weighting strategy to fuse the low-dimensional features of each view so that a unified feature representation can be obtained for multiview multimedia data. We also design an iterative update algorithm to optimize the objective function and justify the convergence of the optimization algorithm through numerical experiments. We conducted clustering experiments on five multiview multimedia datasets and compare the proposed method with several excellent current methods. The experimental results demonstrate that the clustering performance of the proposed method is better than those of the other comparison methods.

1. Introduction

With the rapid development of computer technology, the collected multimedia data from many research fields, such as computer vision, image processing, and natural language processing, always have features with high dimension and complex structures. These high-dimensional data can not only provide abundant information but also bring some problems such as the “curse of dimensionality” [1, 2]. Therefore, how to effectively deal with high-dimensional data has become a widespread concern [3]. Dimensionality reduction is an efficient way to solve this issue, which can map the original data to a low-dimensional space and obtain a low-dimensional representation derived from the hidden information in the original data [4].

In recent years, many dimensionality reduction methods have been proposed for multimedia data [5]. The matrix factorization method has become one of the research hotspots owing to its simple theoretical basis and easy implementation. Principal component analysis (PCA) [6], independent components analysis (ICA) [7], vector quantization (VQ) [8], etc. are well-known matrix factorization methods that can obtain a low-rank approximation matrix by decomposing a high-dimensional data matrix, and they can effectively extract a low-dimensional representation from high-dimensional data. However, these methods do not utilize any constraints on the matrix elements during the process of matrix decomposition. It means that the results allow negative elements, which give rise to the loss of physical meaning in low-dimensional representations. To solve this problem, Lee et al. added nonnegative constraints into matrix decomposition and proposed a nonnegative matrix factorization (NMF) [9] method. The low-dimensional feature representations obtained by NMF method are part-based so that they have strong interpretability. Consequently, NMF has attracted the wide attention of researchers. There are a large number of improved algorithms based on NMF have been emerged, which have achieved great success in computer vision, natural language processing, speech recognition, DNA sequence analysis, and other areas [1013].

NMF decomposes the original nonnegative data matrix into the product of a nonnegative basis matrix and a nonnegative coefficient matrix (also called low-dimensional feature matrix). The original data can be expressed as a linear combination of basis matrices, and the combination coefficients can form the coefficient matrix. Since NMF uses nonnegative constraints, it reflects the intuitive notion of combining parts to form a whole and has better interpretability than other methods. The obtained experimental results indicate that NMF has achieved good performance on image and document clustering tasks. Nevertheless, the traditional NMF method only considers the nonnegativity constraints of the elements, which may result in the obtained basis matrix having poor sparseness and independence. To solve the above problems, researchers have imposed additional constraints on the basis matrix or the coefficient matrix and proposed a series of improved methods. For instance, Hoyer [14] designed a sparsity measurement criterion and proposed an NMF variant with sparsity constraints (NMF-SC). Moreover, to enhance the independence of the obtained basis matrices and low-dimensional representation, Choi [15] proposed orthogonal nonnegative matrix factorization (ONMF), which imposed orthogonal constraints on the basis matrix and the coefficient matrix. However, the above methods have nonnegative limitations on the original data, thereby limiting the applicability of these NMF-based algorithms. Therefore, Ding et al. [16] proposed a semi-nonnegative matrix factorization (SNMF). Different from traditional NMF, SNMF relaxed the limitations on the original data and coefficient matrix and only imposed a nonnegative constraint on the basis matrix. The methods mentioned above have better capabilities than their predecessors for feature extraction and achieved better results in real-world tasks, but they only extracted shallow features [17].

In recent years, deep learning has exhibited outstanding performance in feature representation tasks [1820]. Therefore, many researchers have introduced deep learning into matrix factorization and proposed a large number of deep feature representation methods [2127]. Ahn et al. [21] proposed multilayer nonnegative matrix factorization (MNMF). Different from traditional NMF-based approaches, MNMF decomposed the coefficient matrix several times to obtain an underlying part-based representation that can extract deep hierarchical features from the original data. In addition, to expand the application scope, Trigeorgis et al. [22] integrated deep factorization and semi-NMF to propose a deep semi-nonnegative matrix factorization (deep semi-NMF) method. However, both MNMF and deep semi-NMF only considered the deep decomposition of the coefficient matrix for the training data. For the new test data, the basis matrix was used to obtain the deep low-dimensional representation. Therefore, the basis matrix directly affected the results of the deep low-dimensional representation. To obtain a more accurate deep low-dimensional representation of the original data matrix, Zhao et al. [23] applied deep factorization to the basis matrix and proposed a deep NMF method based on basis image learning.

With the rapid development of the Internet and data collection technology, a large amount of multiview multimedia data can be easily acquired [2830]. For example, an object can be shot from different views. An image can be described with different types of features such as color, texture, and shape. These multiview multimedia data can provide different information for each view, but they also contain potential correlations among these different views. Furthermore, they contain more information than single-view data. It is possible to simply integrate multiview data into single-view data, which ignores the differences and potential correlations between the various views of the data [2830].

Consequently, extensive multiview data dimensionality reduction methods have been proposed [3133]. Liu et al. [34] proposed a multiview NMF (multi-NMF) method which established the relationship between different perspectives by learning the common coefficient matrix among different views. Subsequently, Chang et al. [35] introduced a new regularization term into the multi-NMF and used it for clothing image clustering. Inspired by ONMF, Liang et al. [36] proposed NMF with coorthogonal constraints (NMFCC) for multiview multimedia data clustering. Additionally, to consider the correlations between multiple views, Zhan et al. [37] jointly optimized the graph matrix and concept factorization process and proposed an adaptive structure concept factorization (ASCF) method for multiview clustering. Although the above methods can handle multiview multimedia data well, they still belong to the class of feature representation method based on shallow factorization [38, 39]. The underlying deep features in the multiview data are still not available. Therefore, Zhao et al. [40] maximized the mutual information between various views, which forced the nonnegative representation of the last layer in each view to be as similar as possible. Then, the deep semi-NMF method was applied to multiview multimedia data clustering. Different from the existing studies, to adaptively provide feature weights for different perspectives in the multiperspective deep feature representation procedure, Huang et al. introduced an adaptive-weighted framework into the multiview deep semi-NMF and proposed an adaptive-weighted multiview clustering method based on deep matrix factorization [41]. Unlike the literature [40], it can adaptively assign weights to different views in a multiview deep feature representation. However, these methods still consider only the deep decomposition of the coefficient matrix. Therefore, an adaptive-weighted multiview deep basis matrix factorization (AMDBMF) is proposed for multimedia data clustering in this paper. Different from the above methods, AMDBMF first decomposes the basis matrix using a deep way on the data of each view simultaneously and then integrates the low-dimensional features of all view through the adaptive weighting mechanism to extract more accurate multiperspective deep low-dimensional representations. The flowchart of the proposed AMDBMF approach is shown in Figure 1. At last, we perform extensive experiments on five publicly available multiview multimedia datasets. These experimental results show that the proposed AMDBMF approach outperforms the existing related approaches.

The remainder of this paper is organized as follows. “Related Works” describes the related algorithms including NMF and deep semi-NMF briefly. “Adaptive-Weighted Multiview Deep Basis Matrix Factorization” introduces the adaptive-weighted multiview deep basis matrix factorization (AMDBMF) algorithm in detail. The experimental results and analysis are discussed in “Experiments and Analysis.” Finally, the conclusions are given in “Conclusions and Future Work.”

2.1. Nonnegative Matrix Factorization

Suppose that the given multimedia data can be represented as, where is the dimensionality of the data and is the number of samples. Each sample can be represented as a -dimensional feature vector . NMF is aimed at finding two low-ranking nonnegative matrices and that fulfill . After obtaining and , the original data can be expressed as , that is, each sample can be expressed as a linear combination of the basis matrix , and the coefficient vector is . Therefore, the matrices and are called the basis matrix and coefficient matrix, respectively. The objective function of NMF is defined as follows: where is the Frobenius norm operation.

According to the Karush-Kuhn-Tucker (KKT) condition, the update formulas for variables and are as follows:

2.2. Deep Nonnegative Matrix Factorization

The traditional NMF method can remove redundant information and reveal the hidden semantic features of multimedia data, but it cannot learn an effective feature representation for the data. For example, a facial image contains various changes such as posture, lighting, and expression changes. Therefore, Trigeorgis et al. [22] pointed out that the coefficient matrix, as a low-dimensional representation of high-dimensional data, should be able to continue to be decomposed so that more abstract low-dimensional features can be obtained. Thus, these processes of deep factorization are defined as where and represent the factorization results of the -th layer. It can be seen from Eq. (3) that deep NMF performs a procedure of matrix factorization at each layer and uses the decomposed coefficient matrix as the input data of the next layer to continue decomposing. Consequently, the process of deep matrix factorization performed on the data is expressed as

The objective function of deep NMF is defined as follows:

Similar to that of NMF, the update formula can be defined as follows: where , denotes the reconstruction of the -th layer’s feature matrix, and the symbol represents the dot product of matrices. represents a matrix operation that restrains all the negative elements to zeros and keeps the positive elements unchanged. On the contrary, turns the positive elements to be zeros while the negative elements are to be nonnegative.

3. Adaptive-Weighted Multiview Deep Basis Matrix Factorization

First, an adaptive-weighted multiview deep basis matrix factorization (AMDBMF) method is proposed, which incorporates the nonnegative matrix factorization and deep learning into a unified framework. Next, an optimization algorithm with an iterative updating rule is designed to solve the objective function of AMDBMF. Then, an adaptive-weighted fusion mechanism is provided. Finally, we provide the complexity analysis of the proposed algorithm.

Suppose that denotes a multimedia data set which contains samples. Each sample is described by views. Thus, the -th view’s features for this sample can be represented as . The features of all samples in this view can be represented as .

3.1. Objective Function

First, matrix factorization is performed on the features in each view of the multimedia data, and the objective function can be defined as where and denote the basis matrix and the coefficient matrix of the -th view’s features, respectively.

Then, the deep factorization is performed on . The process is defined as follows: where and denote the basis matrices and coefficient matrices for each layer in the -th view, respectively.

By combining it with Eq. (8), Eq. (7) is further rewritten as follows:

Finally, to fuse the data from multiple perspectives, the final objective function is defined as

3.2. Optimization

From Eq. (10), we can find that the objective function is nonconvex for all variables, but it is convex for each of them on their own. Therefore, we design an iterative update algorithm to find the local optimal solution of the objective function. To solve this problem, one variable is updated while the other variables are fixed. The detailed updating rules are described as follows.

The optimal objective function for variables and can be defined as

Let , and then Eq. (11) can be simplified as

The Lagrangian function of Eq. (12) is expressed as where and are Lagrange multipliers.

Taking the partial derivatives of Eq. (13) with respect to and , and setting these derivatives to zero, we have

According to the KKT condition and [41], the update rules of variables and are as follows: where the symbol represents the dot product of matrices.

Finally, the algorithmic steps of the proposed method are given in Algorithm 1. To make it easier to understand, Figure 2 depicts the block diagram of the proposed optimization algorithm.

Input:
Multiview nonnegative matrix
The hidden features of each layer
Initialize:
The basis matrix for each view
The coefficient matrix for each view
Pretrain:
for do
 for do
 end for
 end for
Update:
Repeat
 for do
 for do
 if
 else
 end
 Update
 Update
 end for
 end for
Until Reach the convergence condition or the maximum number of iterations
Output:
The basis matrix for each view
The coefficient matrix for each view
3.3. Feature Confusion

After obtaining the basis matrix and coefficient matrix of each layer for each view through the optimization algorithm, an adaptive-weighted fusion mechanism is adopted to obtain a low-dimensional representation of the multiview data, and the weight calculation is where is a small constant.

Then, is normalized by Eq. (17)

Finally, since the low-dimensional representation of each view is expressed as , the fusion of the low-dimensional features derived from the multiview data can be expressed as

3.4. Complexity Analysis

Clearly, the proposed algorithm can be divided into two stages: pretraining and fine-tuning. For convenience, suppose that the number of iterations is , is the number of data views, and is the number of layers. The number of features for all views is , and the number of low-dimensional representations for each layer is . In the pretraining process, the complexity of a single view is . Therefore, the complexity of the whole pretraining process is . For the fine-tuning part, the main computational complexity is derived from updating , , and , which requires , , and complexity, respectively. Since , the total computational complexity of the proposed algorithm is .

4. Experiments and Analysis

4.1. Datasets

Five commonly used multiview multimedia datasets from the Internet are used in the experiments to verify the effectiveness of the proposed method.

4.1.1. 3sources

This dataset includes a collection of 416 news events and 948 related news reports from February to April 2009 from three well-known news media outlets, including BBC, Reuters, and Guardian. In the experiments, 169 news items reported by all three news media outlets are used. These news events include six categories: business, entertainment, health, politics, sports, and technology (http://mlg.ucd.ie/datasets/3sources.html).

4.1.2. BBC [42]

This dataset contains 685 news articles collected from the BBC News Network between 2004 and 2005. Each article is divided into four parts, and the data consist of five kinds of news: business, entertainment, politics, sport, and technology (http://mlg.ucd.ie/datasets/segment.html).

4.1.3. BBC Sport [42]

This dataset includes 737 news articles from the BBC Sport network from 2004 to 2005. These news articles cover six fields, such as track, field, cricket, football, rugby, and tennis (http://mlg.ucd.ie/datasets/segment.html).

4.1.4. Reuters [43]

This is a dataset that includes 1200 English articles from six types of samples, and each article has been translated into French, German, Italian, and Spanish (http://lig-membres.imag.fr/grimal/data.html).

4.1.5. Wikipedia [43]

This dataset consists of specific Wikipedia material with 2669 articles in 29 categories (http://www.svcl.ucsd.edu/projects/crossmodal/). In the experiments, we select a subset of the 10 most popular categories containing a total of 693 samples. The detailed statistical information about the different datasets is given in Table 1.

4.2. Metrics

In the experiments, we select three commonly used clustering evaluation indicators [44]: accuracy (ACC), normalized mutual information (NMI), and purity to evaluate the performance of the proposed method.

Assuming that the clustering result of is and that the corresponding true label is , then the clustering accuracy (ACC) [45] is defined as where the function is defined as follows:

The function maps the clustering result to the corresponding true label. The Kuhn-Munkres algorithm [46] is employed to find the best mapping result.

Assume that and are the clustering result and the true label set, respectively. The mutual information (MI) between them is defined as where and represent the probabilities that a sample is randomly selected from the dataset belonging to and , respectively. represents the joint probability of a sample randomly being selected from the dataset belonging to and . Let and represent the entropies of and , respectively. Since the value range of mutual information is between 0 and , the normalized mutual information (NMI) is defined as

Purity is a straightforward and transparent evaluation method that is defined as follows: where represents the number of clusters, is the number of elements in the most numerous category in cluster , and is the number of elements in cluster .

4.3. Experimental Results and Analysis

In the first experiment, to test the influences of the parameters on the proposed method, we set the number of factorization layers and the feature dimension of each layer to and , respectively. Furthermore, we adopt a grid search to find the optimal parameter value. In the experiment, the low-dimensional features obtained by the proposed algorithm are clustered by the -means algorithm. Since the initialization of the -means algorithm has an impact on the clustering results, we repeat the random initialization process with 10 times and report the mean value. First, the optimal feature dimension of each layer is fixed, and the numbers of layers are changed. As shown in Figure 3, in most cases, when the number of layers is set to 1, the result of each measure is poorer than the rest. However, as the number of layers increases, the performance also increases. It shows that the deep factorization helps to improve the performance of the proposed method.

Then, the numbers of layers are fixed, and the dimension of the feature is changed. The result is shown in Figure 4. It can be seen that as the dimensionality increases, the clustering performance also improves in most cases. However, this trend is not always maintained, and the clustering performance decreases or remains stable as the dimensionality increases once the performance reaches the optimal level. The details of the optimal parameter groups in our proposed algorithm are listed in Table 2.

The second experiment is conducted to verify that the fusion of multiview information is beneficial for improving the clustering performance of the proposed method. First, we perform traditional NMF and deep basis matrix factorization (DBMF) for the data of each view. Then, we obtain the low-dimensional features of the multiview data by fusing the features of different views with equal weight. Finally, the proposed AMDBMF method is compared with the above two methods. The comparison results are listed in Tables 35. According to the tables, the performance of the DBMF method is better than that of the traditional NMF method, which indicates that more abstract features can be obtained through the deep factorization. The performance of the proposed AMDBMF method is better than that of the DNBMF method, which verifies that the adaptive fusion of different views is beneficial for extracting more robust low-dimensional features from multiview data.

The third experiment compares the performance of the proposed AMDBMF method with those of some currently popular multiview algorithms, including MVCF [37], DeepMVC [41], GMC [47], and NMFCC [36]. MVCF utilized the correlation information between the views obtained by jointly optimizing the graph matrix of the data of each view. DeepMVC used a nonparameterized adaptive learning method to obtain the weights between views. NMFCC introduces orthogonal constraints into the basis matrix and coefficient matrix. The best results yielded by the different multiview learning methods on different datasets are shown in Tables 68. It can be seen that the performances of the proposed method are significantly better than those of the other comparison methods in most cases. Since these methods use different mechanisms to fuse multiview data information, all the methods present different performances on different databases. Therefore, how to effectively integrate fusion mechanisms is still an open problem.

The final experiment verifies the convergence of the proposed optimization algorithm. The convergence curves of the proposed method on different datasets are given in Figure 5. As seen from the figures, the iterative update rules in Algorithm 1 decrease the objective function value obtained by our proposed method. Moreover, we can also see that our proposed method converges very quickly on these datasets.

5. Conclusions and Future Work

To efficiently learn the feature representations of multiview multimedia data, this paper proposes a new deep nonnegative matrix factorization method with multiview learning. Unlike traditional methods, the proposed method deeply decomposes the basis matrix, so it not only can learn the component representation of the original data but also can learn more abstract deep features. Furthermore, to effectively fuse the available multiview data information, this paper introduces an adaptive feature fusion mechanism.

To solve the shortcoming of information fusion for multiview data, a large number of fusion mechanisms have been proposed, and they achieve different performances on different datasets. Therefore, how to effectively integrate different mechanisms to improve the feature representation ability of a given approach is one of the key research tasks to be addressed in the future. Moreover, we will apply our method to other fields such as medical image procession and medical text analysis [48].

Data Availability

The data are derived from public domain resources.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research is supported by the National Natural Science Foundation of China under grant nos. 62062040, 61962026, 62006174, and 71762018, the Chinese Postdoctoral Science Foundation (grant no. 2019M661117), the Provincial Key Research and Development Program of Jiangxi under grant nos. 20192ACBL21031 and 20202BABL202016, the Science and Technology Research Project of Jiangxi Provincial Department of Education (grant nos. GJJ191709 and GJJ191689), Fundamental Research Funds for the Central Universities under grant no. 2412019FZ049, the Graduate Innovation Foundation Project of Jiangxi Normal University under grant no. YJS2020045, and the Young Talent Cultivation Program of Jiangxi Normal University.