Improvements for research data repositories: The case of text spam,Journal of Information Science

当前位置： X-MOL 学术 › J. Inf. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improvements for research data repositories: The case of text spam
Journal of Information Science ( IF 2.4 ) Pub Date : 2021-03-02 , DOI: 10.1177/0165551521998636
Ismael Vázquez ₁ , María Novo-Lourés ₂ , Reyes Pavón ₂ , Rosalía Laza ₂ , José Ramón Méndez ₂ , David Ruano-Ordás ₂

Affiliation

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML (Computer Science/Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.

中文翻译：

研究数据存储库的改进：文本垃圾邮件的案例

当前的研究以这样的方式发展，科学家不仅必须充分描述所引入的算法及其应用的结果，而且还必须确保重现结果并将其与通过其他近似获得的结果进行比较的可能性。在这种情况下，公共数据集（有时通过存储库共享）是开发实验协议和测试平台的最重要元素之一。这项研究分析了大量CS / ML（计算机科学/机器学习）研究数据存储库和数据集，并检测到一些限制其实用性的限制。特别是，我们确定并讨论了存储库的以下苛刻功能：（1）为特定的研究任务构建定制的数据集；（2）使用不同的预处理方法促进不同技术的比较；（3）确保软件应用程序的可用性在不使用存储库功能的情况下重现预处理步骤，以及（4）提供用于许可问题和用户权限的保护机制。为了显示引入的功能，我们创建了STRep（垃圾邮件文本存储库）Web应用程序，该应用程序实现了适合垃圾邮件文本存储库领域的建议。另外，我们在URL https：//rdata.4spam中启动了一个STRep实例。

更新日期：2021-03-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>