Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans un flux de tweets : \'etude sur des corpus fran\c{c}ais et anglais,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans un flux de tweets : \'etude sur des corpus fran\c{c}ais et anglais
arXiv - CS - Information Retrieval Pub Date : 2020-01-13 , DOI: arxiv-2001.04139
B\'eatrice Mazoyer (MICS), Nicolas Herv\'e (INA), C\'eline Hudelot (MICS), Julia Cage (ECON)

In this work, we evaluate the performance of recent text embeddings for the automatic detection of events in a stream of tweets. We model this task as a dynamic clustering problem.Our experiments are conducted on a publicly available corpus of tweets in English and on a similar dataset in French annotated by our team. We show that recent techniques based on deep neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on many applications, are not very suitable for this task. We also experiment with different types of fine-tuning to improve these results on French data. Finally, we propose a detailed analysis of the results obtained, showing the superiority of tf-idf approaches for this task.

中文翻译：

Repr\'esentations lexicales pour la d\'etection non supervis\'ee d'\'ev\'enements dans unflux de tweets : \'etude sur des corpus fran\c{c}ais et anglais

在这项工作中，我们评估了最近文本嵌入在推文流中自动检测事件的性能。我们将此任务建模为动态聚类问题。我们的实验是在公开可用的英语推文语料库和我们团队注释的类似法语数据集上进行的。我们展示了基于深度神经网络（ELMo、通用句子编码器、BERT、SBERT）的最新技术，虽然在许多应用中很有前景，但不太适合这项任务。我们还尝试了不同类型的微调，以改善法国数据的这些结果。最后，我们对获得的结果进行了详细分析，显示了 tf-idf 方法在该任务中的优越性。

更新日期：2020-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>