Abstract

In the field of information science, how to help users quickly and accurately find the information they need from a tremendous amount of short texts has become an urgent problem. The recommendation model is an important way to find such information. However, existing recommendation models have some limitations in case of short text recommendation. To address these issues, this paper proposes a recommendation model based on semantic features and a knowledge graph. More specifically, we first select DBpedia as a knowledge graph to extend short text features of items and get the semantic features of the items based on the extended text. And then, we calculate the item vector and further obtain the semantic similarity degrees of the users. Finally, based on the semantic features of the items and the semantic similarity of the users, we apply the collaborative filtering technology to calculate prediction rating. A series of experiments are conducted, demonstrating the effectiveness of our model in the evaluation metrics of mean absolute error (MAE) and root mean square error (RMSE) compared with those of some recommendation algorithms. The optimal MAE for the model proposed in this paper is 0.6723, and RMSE is 0.8442. The promising results show that the recommendation effect of the model on the movie field is significantly better than those of these existing algorithms.

1. Introduction

With the rapid development of the Internet, smart terminals. and digital resources, recommendation systems are becoming more and more important as a tool for users to obtain the information they need in the network. Moreover, since more and more information have characteristics as short text with few features and incomplete information (e.g., microblog and short messages, movie introductions, and product reviews), accurate recommendation of items based on short texts is a hot topic on the recommendation system.

Unfortunately, the existing recommendation systems have some drawbacks for short text recommendation. For example, the traditional collaborative filtering technology mainly considers user ratings and ignores important semantic factors; LDA-based recommendation is effective in processing texts with complete information, but it has drawbacks in processing short texts; neural network-based recommendation requires a lot of computing resources, and its performance is highly related to the quality of the corpus.

To solve the above problems, we propose a recommendation model based on semantic features and extension using a knowledge graph for the current tremendous amount of items with short text characteristics. We first extend the short texts of items based on a knowledge graph and combine the semantic features of the items according to user ratings, making the recommendation result more semantically accurate.

The main contributions of this paper are summarized as follows:(i)We propose an improved method based on a knowledge graph of short text feature extension; it can extend the semantic information of the item. We first select DBpedia as a knowledge graph, then use DBpedia Spotlight to identify the named entity in the short texts of items, and then obtain the extension words according to the resource pages in DBpedia of these identified entities(ii)We propose a recommendation model based on semantic features and a knowledge graph. We first extend the short texts of items based on a knowledge graph and then combine the user semantic similarity and the item semantic similarity according to the semantic features and further calculate prediction ratings of items

The remainder of this paper is organized as follows. Section 2 summarizes the related work on the recommendation system. Section 3 presents the proposed method of short text feature extension based on DBpedia. Section 4 presents the proposed recommendation model. In Section 5, experiments are conducted to verify the effectiveness of the proposed model. Finally, Section 6 presents the conclusions of this paper and the future work.

The research of the recommendation system mainly includes the following categories: (1) the traditional collaborative filtering recommendation algorithms, (2) the recommendation algorithms based on deep learning, and (3) the recommendation algorithms based on the content of the item.

First, traditional collaborative filtering recommendation algorithms include user-based collaborative filtering algorithms [15] and item-based collaborative filtering algorithms [69]. The idea of the user-based collaborative filtering algorithm is when a user needs to be recommended, we would first find other users who are similar to him and then recommend those users’ preferred items to him. The item-based collaborative filtering algorithm is to find items that are similar to the user’s preferred items for recommendation. Because these two collaborative filtering recommendation algorithms have some deficiencies, such as items with high popularity and cold start, scholars have proposed several recommendation algorithms based on hybrid strategies [1012]. However, traditional recommendation algorithms mainly consider aspects such as ratings and user features and often ignore important semantic features.

Secondly, with the successful application of deep learning in the fields of computer vision and natural language processing, many scholars have introduced deep learning into the recommendation system, using deep feature extraction to improve recommendation performance.

Georgiev proposed to use RBM to extract potential features of user preferences or item ratings in the recommendation field [13]. In addition, DBN [14] is also used to help extract hidden and useful features from the audio content for content-based and mixed music recommendation. Deep learning can effectively extract the features of user items and other contents through deep structure mining features and enhance the recommendation capability. However, the implementation of a deep learning model requires a lot of computing resources and its recommendation effect is closely related to the corpus, which brings the challenge of constructing an ideal corpus. Moreover, due to the lack of information for items with short text characteristics, the effect of directly using deep learning methods is not ideal.

Thirdly, some scholars consider recommendation based on the content of the item. This is the continuation and development of collaborative filtering technology. For example, An et al. proposed content-based personalized recommendation of popular microtopics [15]. Some researchers introduced the LDA model; Zhao et al. proposed a Twitter-LDA model suitable for short texts and gave the topic construction idea of the “single microblog-single topic” [16]. Ben-Lhachemi et al. proposed to use the semantic embedding representation of tweets to help users select relevant tweet topics for their posts in real time and then capture the semantic similarity or relevance between tweets, so as to realize the recommendation of tweet topics [17]. However, due to the small number of words in the short texts of items, the topics of the short text are poor and the problem of topic incompleteness occurs in the process of topic modeling; therefore, the effect of using the topic to extend the feature of the short text is not ideal either.

Therefore, for the items with short text characteristics, since the short text contains little or even the lack of information, directly using the above methods will have problems such as unsatisfactory recommendation effect or excessively complicated implementation. Obviously, if we want to obtain more accurate recommendation results, we must first effectively extend the items with short text characteristics.

3. The Short Text Feature Extension Method Based on DBpedia

There are two main existing text feature extension algorithms applied to short texts: one is based on external documents and search engines [18, 19] and the other is based on the external knowledge base [2022]. The first method has a higher correlation between the original feature words and the extended features after processing, but it is difficult to implement and time-consuming. At the same time, it depends on the quality of the results returned by the search engine to a certain extent. The second extension method can alleviate the data sparse problem of short texts, but the extension effect depends on the quality of the external knowledge base, and the amount of calculation is large and time-consuming.

Therefore, considering that short texts of item often contain entities and entities have rich meanings, it is a simple and efficient choice to extend from entities. The DBpedia knowledge graph is chosen as the external knowledge source. On the basis of the Li extension method [23], we propose an improved short text extension method to extend the semantic information of the short texts of items and then integrate the semantic features of the item when recommending.

The extension process of short texts of items based on the DBpedia is shown in Figure 1, and it mainly includes the following tasks: first, we use DBpedia Spotlight [24] to identify entities of the short texts of items and express them as entities of DBpedia (there is noise and need to be filtered), so as to get the source feature entity set. Secondly, based on DBpedia to further extend the previously entities, according to entity resource pages of DBpedia, put the values of the Type attribute for these feature entities as the extended entities of the short texts; then, we obtain the extended words set of the short texts.

One of the tasks is as described in Algorithm 1, first, using DBpedia Spotlight, an open-source named the entity recognition system, to identify the named entity in short texts and then generate a set of entity candidates and further disambiguate the candidate entities. Among them, by calculating the labeling probability of named entities in Wikipedia [25], the candidates that are lower than a certain threshold will be deleted. In the disambiguation processing, the method proposed by Han and Sun [26] is used to realize the disambiguation of the entities.

Input: short texts of items
Output: S - source feature entity set of items
1:S
2: Use DBpedia Spotlight to annotate named entity words in short texts, and calculate their tagging probability values, and get a set of named entity words according to a given threshold. And
3: for to
4: According to the Wikipedia page of wi, we get the candidate entity set for . And .
5: Calculate the probability of each candidate entity in E(wi) being marked as an entity under the current context conditions, and we select the entity with the largest probability to add to S.
6: end for
7: return

In Algorithm 1, we use DBpedia Spotlight to annotate the named entity in the short texts. The threshold of the tagging probability can be set according to your own needs. We refer to the literature [25] and set it to 0.45.

Another task is as described in Algorithm 2, by the entity resource pages in DBpedia; we use the Type attribute value of the entity in the resource page as candidate extension words for the short text; then, the union of the source feature set and the extended words set is taken as the final features set.

Flisar and Podgorelec [27] and others studied the Type, Topic, and Category attributes of the DBpedia knowledge graph entity resource page and believed that the information of the Type attribute is the most effective for the text extension of the entity. In order to avoid introducing ambiguous information, we only consider the Type attribute. This information is obtained from the Infobox of the Wikipedia page while constructing DBpedia. It is closely related to the entity and can achieve high-quality semantic extension.

When choosing an extended entity, we generally decide the choice by comparing the semantic similarity between the entities. There are many ways to measure the semantic similarity of entities, for example, the cosine similarity of the embedding vector of the entity word can be used but the accuracy of the embedding vector depends on the quality of the corpus and training. Therefore, considering that the entity is derived from the Type attribute of source feature entities and these entities are at different levels of the taxonomy structure of the knowledge graph, it is very suitable for passing through the distance of their semantic path and depth to measure the similarity between them. We adopt a simple and efficient method [28], and the semantic similarity is calculated according to equation (1) as follows:where Dist represents the number of edges of the shortest path between entity and entity and represents the maximum depth of the taxonomy structure.

Input: - source feature entity set of items
Output: - extended words set
1:
2: if then output
3: for to
4: According to the Type attribute of the resource page in DBpedia, the extended word set of is . And .
5: Calculate between and according to formula (1), and determine the extended word set for according to a given threshold, and .
6: end for
7: return

4. Recommendation Model Based on Semantic Features and DBpedia

From Sections 1and 2, we know that most of the current recommendation algorithms ignore important semantic factors. Therefore, we consider semantic features in the recommendation model proposed in this paper. As more and more items (such as microblog, short messages, movie introductions, and product reviews) have short text characteristics, their short text processing cannot simply be processed by traditional text processing methods and usually need to be extended first. But how to extend is also a difficult problem. In this paper, we use the distance of the semantic path and the depth to measure similarity between entities and give an improved method of item short text extension based on DBpedia and then propose a recommendation model based on semantic features (see Figure 2).

4.1. Construct User-Item Rating Matrix

According to the user’s rating of the item, a user-item rating matrix can be constructed.

Definition 1. Assume that in the dataset, the user set is ( users) and the item set is ( items); then, the user-item rating matrix is defined as a matrix with rows and columns, each row represents a user, and the columns correspond to different items and the corresponding elements represent the ratings given by the user for the item, namely,

4.2. User Semantic Similarity Calculation

The short texts of items are extended based on DBpedia after data cleaning, word splitting, stopping word removal, stemming, and other preprocessing; then, we use the word2vec [29] toolkit to deal with the extension word set, so as to get an embedding representing the given dimension of each word. And further, we calculate the vector of the short texts for each item; for simplicity, the average of word embedding is used to represent the vector of the item [30].

Definition 2. Given item , if the word set of its short text after the above extension is , for , assuming that the corresponding word embedding of a given dimension obtained by word2vec processing is , then, the item vector of is defined as follows:

Definition 3. Assume that the first items whose rating of are greater than a given value are selected as and the corresponding vector of qi is , we define the representation vector of as follows:

Definition 4. Assume that the representation vectors of and are and ( and are users), respectively; we define the semantic similarity between and as follows: .

4.3. Predict the Ratings of Items

In the model proposed in this paper, because some users may tend to give high or low ratings to all items when rating, we provide the relative difference of ratings in the prediction to make the results more reasonable. Therefore, we use a strategy based on the average of user ratings to calculate prediction ratings of items. Specifically, according to user-item rating matrix and the semantic features of the short texts of items, we integrate the ’s semantic similarity to calculate the user ’s prediction rating for item . The formula is as follows:where and are the weight coefficients and . The formulas of and are defined as follows:where in equation (6), is semantic similarity and represents the set of the K-nearest neighbor of (calculated by semantic similarity) and represents the average of the ratings generated by and represents the rating of for ( is a neighbor of , and represents the average of the ratings generated by .

In equation (7), represents the set of the K-nearest neighbor of (calculated by semantic similarity) and and are the vectors corresponding to and , respectively.

Algorithm 3 gives the specific process of the proposed model.

Input: The user (u) to be recommended, short texts of items;
Output: The prediction rating of the item (i) by the user .
1: For each item i(iI), extend short texts of the item based on the DBpedia, get extension feature set W(i).
2: Calculate the item vector of i according to W(i).
3: Construct the user-item rating matrix for dataset.
4: Construct user semantic similarity matrix.
5: Select the set KNN(u) composed of the first K users most similar to u.
6: According to formula (5), calculate prediction rating of i by u (pui).

5. Experiment and Result Analysis

The experiment in this paper uses the MovieLens1M dataset published by UCI for experimentation. MovieLens was developed by the GroupLens project team of the University of Minnesota. It contains 6040 users’ ratings of 3952 movies, with a total of 1000209 ratings. Each user will rate at least 20 movies (the rating value is an integer between 0–5 and 0 means that the user did not rate the movie) and also provides auxiliary information such as the user’s occupation, movie category, and movie duration. The sparsity of this dataset is 95.80%.

The movie data (movies.dat) in the dataset includes fields such as MovieID, Title, and Genres, which represent the movie ID, movie name, and movie style, respectively. In this experiment, data in the two fields of Title and Genres of the movie data is considered to be the short texts of the corresponding movie item. In addition, the user data (user.dat) is randomly divided into 90% training set and 10% test set according to the user ID.

We use genism as the word embedding calculation toolkit in the experiment, which is an open-source third-party python toolkit. It supports a variety of model algorithms including word2vec and streaming training and provides API for common tasks such as similarity calculation and information retrieval.

The hardware environment used in the experiment settings is as follows: CPU is Intel® Core™ i7-8700, and the memory is 16 G Double Data Rate Fourth SDRAM and equipped with 4 TB hard drive and 128 G Solid State Disk. The software environment is a Windows10 operating system, Python3.8, and Gensim development platform.

5.1. Evaluation Metrics

Since the MovieLens1M dataset contains the user’s rating for each item, so we can train and learn the user’s prediction rating in experiment.

We use mean absolute error (MAE) and root mean square error (RMSE) as evaluation metrics of the experiment, which are widely used in recommendation systems. These metrics can measure the error between the user’s actual rating and the prediction rating. When the MAE and RMSE are smaller, the error between the prediction rating and the actual rating is smaller and the prediction rating accuracy of the algorithm is higher. For ( is a user) to be recommended, let be the set of items in the dataset where users have rating behavior and is the specific number of . For , let denote ’s actual rating for and is the prediction rating obtained by the model proposed in this paper. MAE and RMSE are defined as follows:

5.2. Coefficient Determination

The recommendation model based on semantic features and extension using DBpedia includes weight coefficients and which appear in equation (5), and the value of the parameters will affect the quality of the recommendation result.

In equation (5), and are weight coefficients and .

The prediction ratings of items combine the users’ semantic features. Therefore, we select several typical weight combinations (as listed in Table 1) and set the number of the nearest neighbor to 20 for experiment and, then, calculate the MAE values of different coefficient weights. The results are shown in Figure 3. It can be found that the recommendation effect of the model proposed in this paper is the best when the weights are set to the fifth case (that is, , ).

5.3. Analysis of Results

The experiment is mainly divided into two parts:(i)Analyze the recommendation effect of the traditional collaborative filtering technology and the model proposed in this paper on the MovieLens1M dataset and consider different numbers of the nearest neighbor.(ii)Compare the recommendation effect of the model in this paper and some other models. Considering that some of these models are not tested based on the number of the nearest neighbor, the best experimental data of various models are selected.

Experiment 1. We compared the recommendation effect of the traditional collaborative filtering recommendation technology (CF) and the model proposed in this paper (SF_EU_DBpedia) at different nearest neighbor numbers (including , 10, 15, 20, 30, 40, and 50). We use experiments to investigate the impact of different recommendation models and different numbers of the nearest neighbor on MAE and RMSE, and the results are shown in Figures 4 and 5 as follows:

As shown in Figure 4, the MAE values for the method proposed in this paper (SF_EU_DBpedia) are always much better than the traditional collaborative filtering recommendation technology (CF) at different numbers of the nearest neighbor and so are the RMSE in Figure 5.

Experiment 2. In the experiment, we choose the traditional collaborative filtering recommendation technology (CF), collaborative filtering recommendation integrating the user-centric natural nearest neighbor (CF3N) [5], enhanced multistage user-based collaborative filtering through nonlinear similarity (EMUCF) [12], and deep neural network-based recommendation algorithm (DNN) [31] in comparison with the method proposed in this paper (SF_EU_DBpedia), since the EMUCF and DNN do not perform experiments based on the numbers of the nearest neighbor; for comparison, we select the best experimental results of various algorithms. The experimental results are shown in Figures 6 and 7 as follows:

In Figure 6, the optimal MAE obtained by the EMUCF is about 0.7211 and the method proposed in this paper is 0.6723. In Figure 7, the optimal RMSE obtained by the DNN is about 0.9631 and that by the method proposed in this paper is 0.8442. The experimental results show that the method proposed in this paper is superior to the abovementioned algorithms on MAE and RMSE.

6. Conclusion

We proposes a recommendation model based on semantic features and a knowledge graph and first select DBpedia as a knowledge graph to extend short text features of items and then integrate semantic features to calculate the prediction rating of the user to be recommended. Experiment results are shown that the proposed model in this paper works well.

As future works, we are planning to calculate the semantic vector that characterizes users for various items of different categories, so as to further improve the general applicability of the model. In addition, we can consider more factors for selecting nearest neighbors, such as user semantic similarity and user characteristics; thus, we can choose suitable factors to achieve more accurate personalized recommendation according to categories of items.

Data Availability

The data included in this paper are available without any restriction.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

We wish to express our appreciation to the reviewers for their helpful suggestions which greatly improved the presentation of this paper. This work was supported by the project of the Guangzhou Science and Technology Bureau, China, under Grant no. 202007040006.