A hashtag recommendation system for twitter data streams.,Computational Social Networks

当前位置： X-MOL 学术 › Comput. Soc. Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A hashtag recommendation system for twitter data streams.
Computational Social Networks Pub Date : 2016-05-31 , DOI: 10.1186/s40649-016-0028-9
Eriko Otsuka ₁ , Scott A Wallace ₁ , David Chiu ₂

Affiliation

Twitter has evolved into a powerful communication and information sharing tool used by millions of people around the world to post what is happening now. A hashtag, a keyword prefixed with a hash symbol (#), is a feature in Twitter to organize tweets and facilitate effective search among a massive volume of data. In this paper, we propose an automatic hashtag recommendation system that helps users find new hashtags related to their interests on-demand. For hashtag ranking, we propose the Hashtag Frequency-Inverse Hashtag Ubiquity (HF-IHU) ranking scheme, which is a variation of the well-known TF-IDF, that considers hashtag relevancy, as well as data sparseness which is one of the key challenges in analyzing microblog data. Our system is built on top of Hadoop, a leading platform for distributed computing, to provide scalable performance using Map-Reduce. Experiments on a large Twitter data set demonstrate that our method successfully yields relevant hashtags for user’s interest and that recommendations are more stable and reliable than ranking tags based on tweet content similarity. Our results show that HF-IHU can achieve over 30 % hashtag recall when asked to identify the top 10 relevant hashtags for a particular tweet. Furthermore, our method out-performs kNN, k-popularity, and Naïve Bayes by 69, 54, and 17 %, respectively, on recall of the top 200 hashtags.

中文翻译：

推特数据流的标签推荐系统。

Twitter 已经发展成为一种强大的交流和信息共享工具，全世界数百万人使用它来发布当前正在发生的事情。hashtag 是一个以井号 (#) 为前缀的关键字，是 Twitter 中的一项功能，用于组织推文并促进在大量数据中进行有效搜索。在本文中，我们提出了一个自动标签推荐系统，可以帮助用户按需找到与其兴趣相关的新标签。对于标签排名，我们提出了标签频率-逆标签普遍性（HF-IHU）排名方案，它是著名的 TF-IDF 的变体，它考虑了标签相关性以及作为关键之一的数据稀疏性分析微博数据的挑战。我们的系统建立在 Hadoop 之上，Hadoop 是一个领先的分布式计算平台，使用 Map-Reduce 提供可扩展的性能。在大型 Twitter 数据集上进行的实验表明，我们的方法成功地生成了符合用户兴趣的相关主题标签，并且推荐比基于推文内容相似度的排名标签更稳定和可靠。我们的结果表明，当被要求识别特定推文的前 10 个相关主题标签时，HF-IHU 可以实现超过 30% 的主题标签召回。此外，在召回前 200 个标签时，我们的方法比 kNN、k-popularity 和朴素贝叶斯分别高出 69%、54% 和 17%。我们的结果表明，当被要求识别特定推文的前 10 个相关主题标签时，HF-IHU 可以实现超过 30% 的主题标签召回。此外，在召回前 200 个标签时，我们的方法比 kNN、k-popularity 和朴素贝叶斯分别高出 69%、54% 和 17%。我们的结果表明，当被要求识别特定推文的前 10 个相关主题标签时，HF-IHU 可以实现超过 30% 的主题标签召回。此外，在召回前 200 个标签时，我们的方法比 kNN、k-popularity 和朴素贝叶斯分别高出 69%、54% 和 17%。

更新日期：2016-05-31

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文