当前位置: X-MOL 学术Inf. Syst. Front. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Analyzing the Quality of Twitter Data Streams
Information Systems Frontiers ( IF 5.9 ) Pub Date : 2020-11-04 , DOI: 10.1007/s10796-020-10072-x
Franco Arolfo 1 , Kevin Cortés Rodriguez 1 , Alejandro Vaisman 1
Affiliation  

There is a general belief that the quality of Twitter data streams is generally low and unpredictable, making, in some way, unreliable to take decisions based on such data. The work presented here addresses this problem from a Data Quality (DQ) perspective, adapting the traditional methods used in relational databases, based on quality dimensions and metrics, to capture the characteristics of Twitter data streams in particular, and of Big Data in a more general sense. Therefore, as a first contribution, this paper re-defines the classic DQ dimensions and metrics for the scenario under study. Second, the paper introduces a software tool that allows capturing Twitter data streams in real time, computing their DQ and displaying the results through a wide variety of graphics. As a third contribution of this paper, using the aforementioned machinery, a thorough analysis of the DQ of Twitter streams is performed, based on four dimensions: Readability, Completeness, Usefulness, and Trustworthiness. These dimensions are studied for several different cases, namely unfiltered data streams, data streams filtered using a collection of keywords, and classifying tweets referring to different topics, studying the DQ for each topic. Further, although it is well known that the number of geolocalized tweets is very low, the paper studies the DQ of tweets with respect to the place from where they are posted. Last but not least, the tool allows changing the weights of each quality dimension considered in the computation of the overall data quality of a tweet. This allows defining weights that fit different analysis contexts and/or different user profiles. Interestingly, this study reveals that the quality of Twitter streams is higher than what would have been expected.



中文翻译:

分析 Twitter 数据流的质量

人们普遍认为,Twitter 数据流的质量通常较低且不可预测,因此在某种程度上,基于此类数据做出决策并不可靠。这里介绍的工作从数据质量 (DQ) 的角度解决了这个问题,基于质量维度和指标,调整了关系数据库中使用的传统方法,以捕捉 Twitter 数据流的特征,尤其是大数据的特征。一般意义。因此,作为第一个贡献,本文为所研究的场景重新定义了经典的 DQ 维度和指标。其次,本文介绍了一种软件工具,可以实时捕获 Twitter 数据流,计算其 DQ 并通过各种图形显示结果。作为本文的第三个贡献,使用上述机器,对 Twitter 流的 DQ 进行了彻底的分析,基于四个维度:可读性、完整性、有用性和可信赖性。这些维度针对几种不同的情况进行了研究,即未过滤的数据流,使用关键字集合过滤的数据流,以及对涉及不同主题的推文进行分类,研究每个主题的 DQ。此外,尽管众所周知,地理定位的推文数量非常少,但本文研究了推文相对于发布地点的 DQ。最后但同样重要的是,该工具允许更改在计算推文的整体数据质量时考虑的每个质量维度的权重。这允许定义适合不同分析上下文和/或不同用户配置文件的权重。有趣的是,

更新日期:2020-11-05
down
wechat
bug