当前位置: X-MOL 学术International Journal on Digital Libraries › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identification of tweets that mention books
International Journal on Digital Libraries Pub Date : 2019-08-05 , DOI: 10.1007/s00799-019-00273-4
Shuntaro Yada , Kyo Kageura , Cecile Paris

We address the task of identifying tweets that mention books from amongst tweets that contain the same strings as book titles. Assuming the existence of a comprehensive list of book titles, this task can be defined as text classification targeting tweets that contain the same string as book titles. In carrying out the task, we need to exclude two types of tweets. The first is automatically posted, spam-like tweets that promote book sales or post recommendations (bot tweets). This type of tweets is excluded because we are developing an online surrogate to book exposure embedded within human communication on social media, and the results of the present task are to be used in this system. The second is tweets that contain the same string as book titles but are not about books (noise tweets). We proposed a two-step, machine learning-based pipeline consisting of bot filtering followed by noise reduction. Evaluation of experiments showed that our proposed method achieved an F1-score of 0.76, which is comparable to the best performance reported in similar tasks and sufficient as a first step for use in practical applications. We also analysed the detailed performance and errors, which suggested that the proposed method maintained an appropriate balance between precision and recall, and can be further improved by increasing the data size and taking into account word senses.



中文翻译:

识别提及书籍的推文

我们解决了从包含与书名相同的字符串的推文中识别提及书籍的推文的任务。假设存在一个完整的书名列表,则可以将该任务定义为目标文本推文,这些推文包含与书名相同的字符串。在执行任务时,我们需要排除两种类型的推文。第一个是自动发布的,类似垃圾邮件的推文,以促进图书销售或发布推荐(自动推文)。排除此类推文是因为我们正在开发一个在线替代产品,以预订嵌入社交媒体中人类交流中的曝光量,并且本任务的结果将在此系统中使用。第二种是包含与书名相同的字符串但与书无关的推文(杂音推文)。我们提出了两步,基于机器学习的管道,包括机器人过滤和降噪。实验评估表明,我们提出的方法的F1得分为0.76,可与类似任务中报告的最佳性能相媲美,并且足以作为实际应用中的第一步。我们还分析了详细的性能和错误,这表明所提出的方法在精度和查全率之间保持了适当的平衡,并且可以通过增加数据大小和考虑词义来进一步改进。

更新日期:2019-08-05
down
wechat
bug